How to Generate Audio Using Text-to-Speech AI Model Bark

Gandhali Joshi 06 Oct, 2023 • 7 min read

Introduction

Bark is an open-source, fully generative text-to-audio model created by Suno.ai that can generate highly realistic, multilingual speech, including background noise, music, and simple sound effects. It follows a GPT-style architecture capable of deviating in unexpected ways from any given script. Typical text-to-speech (TTS) engines produce robotic and machine-generated monotonous sounds. Bark generates highly realistic and natural-sounding voices using GPT style models and gives a fantastic experience like listening to actual human beings.

Bark TTS text-to-speech generative model from Suno.ai

Learning Objectives

  • Learn about the basic usage and functionality of the Bark model, its limitations, and its applications.
  • Learn how to generate audio files from text using Python code.
  • Creating large speech using NLTK and Bark library in Python

This article was published as a part of the Data Science Blogathon.

Installing Bark

Let’s use a Google Colab notebook to understand the functionality and applications of Bark.

To install Bark, use the command pip install git+https://github.com/suno-ai/bark.git.

pip install git+https://github.com/suno-ai/bark.git

Note: Don’t use ‘pip install bark’ as it installs a different package not managed by Suno.ai.

Generating Audio with Bark

Bark supports a variety of languages like English, Chinese, French, Hindi, German, etc. It also supports a Bark speaker library, which contains multiple voice prompts for supported languages. Please check the speaker library list here.

Bark comes with some pre-defined tags/notes like Background Noise, Auditorium, Silence at the Beginning, etc., which help understand speaker usage. You can set a suitable prompt in Python code using these tags based on the user’s requirement.

Applications of Bark text-to-speech model

The below written Python code generates an audio file based on the selected speaker.

from bark import SAMPLE_RATE, generate_audio, preload_models
# import Audio to listen to generate audio array in notebook.
from IPython.display import Audio

For a given text input, the generate_audio function will return a numpy audio array at the output with a sample frequency of 24khz. The history prompt picks the selected speaker from the speaker library list. The model then uses Scipy to save a .wav type sound file to the desired location for further usage.

# Text which needs to be converted into Speech
text_prompt1 = """
A Learjet 45 aircraft with eight people on board
veered off on thursday"""
# generate audio numpy array for given text
speech_array1 = generate_audio(text_prompt1,
                history_prompt="en_speaker_6")
# play text in notebook
Audio(speech_array1, rate=SAMPLE_RATE)
import scipy
scipy.io.wavfile.write("bark_out1.wav", rate=SAMPLE_RATE,data=speech_array1)

Bark automatically detects script in a given text and generates audio with an appropriate language speaker if it is not specified. Certain speaker prompts, such as Narrator, Man, Woman, etc., can be given for generating specific audio speeches. However, these are not always respected, especially if a conflicting audio history prompt is given.

text_prompt2 = """
woman: Hi Shakira ,how are you?
"""
speech_array2 = generate_audio(text_prompt2)
# play text in notebook
Audio(speech_array2, rate=SAMPLE_RATE)

Generating Non-Verbal Speech with Bark

Bark is a fully generative text-to-speech model devolved for research and demo purposes. Different from previous approaches, the input text prompt directly converts to audio without the intermediate use of phonemes. It can, therefore, generalize to arbitrary instructions beyond speech, such as music lyrics, sound effects, or other non-speech sounds. Users can also produce non-verbal communication using Bark, such as Laughing, singing, adding hesitation, etc. Below is a list of some known non-speech sounds which can be generated using Bark.

  • [laughter]
  • [laughs]
  • [sighs]
  • [music]
  • [gasps]
  • [clears throat]
  • — or … for hesitations
  • ♪ for song lyrics
  • CAPITALIZATION for emphasis of a word
  • [MAN] and [WOMAN] to bias Bark toward male and female speakers, respectively

Bark can generate all types of audio and, in principle, doesn’t see a difference between speech and music. Sometimes, Bark chooses to generate text as music, but you can help it out by adding music notes around your lyrics.

Check the below Python code for the generation of hesitation in speech and music.

text_prompt3 = """
I like Indian food but ... sometimes its very SPICY
"""                 #... adds hesitation in speech.
speech_array3 = generate_audio(text_prompt3,history_prompt="en_speaker_4")
# play text in notebook
Audio(speech_array3, rate=SAMPLE_RATE)
text_prompt4 = """
    ♪ 5 little ducks went swimming one day ♪
"""
speech_array4 = generate_audio(text_prompt4)
# play text in notebook
Audio(speech_array4, rate=SAMPLE_RATE)

Bark has the capability to fully clone voices, including tone, pitch, emotion, etc., from the input audio. It can be misused for creating speech by using known, famous voices and generating fraudulent, malicious content. Due to this ethical issue, the original Bark library restricts the audio history prompts to a limited set of fully synthetic options provided by Suno.ai for each supported language. A list of these speaker prompts is mentioned in the Bark speaker library.

Large Sentence Audio Processing with Bark

Bark has limited its output speech length to 13-14 seconds. So, if you give it a very large input text, it will break the text and produce output for 14 seconds only. As Bark is a GPT-style model, its optimized architecture can only produce speech with roughly this length. For generating larger lengths of audio, you will have to split the required text into smaller sentences. Then, generate audio for each of them and combine all such audio files for overall audio generation.

Follow the step-by-step process below for generating a short story audio speech using Bark.

Step 1: Use the NLTK library to split longer text into sentences and generate a list of sentences.

story_1 = """
There was once a hare who was friends with a tortoise. One day,
he challenged the tortoise to a race.Seeing how slow the tortoise was going,
the hare thought he’ll win this easily. So he took a nap while the tortoise
kept on going.When the hare woke up, he saw that the tortoise was already at
the finish line. Much to his chagrin, the tortoise won the race while he was
busy sleeping.""".replace("\n", " ")
sentences = nltk.sent_tokenize(story_1)

Step 2: Generate audio files for each sentence using Bark generate audio function and add a quarter second of silence after each sentence. Create a for loop for generating audio for a sentence and then add silence to it.

SPEAKER = "v2/en_speaker_6"
# quarter second of silence
silence = np.zeros(int(0.25 * SAMPLE_RATE))
pieces = []
for sentence in sentences:
    audio_array = generate_audio(sentence,history_prompt=SPEAKER)
    pieces += [audio_array, silence.copy()]

Step 3: Concatenate the sequence of audio files generated and then check the combined sound file for listening to full speech.

Audio(np.concatenate(pieces), rate=SAMPLE_RATE)

The final combined audio file generates a good voice clip of the narration of the entire story.

Final audio file created using Suno AI's Bark

Improving Generated Speech

If the given text is too short, Bark will add a little extra audio at the end of the prompt on its own. This results in the generation of bad audio output. Here’s an example.

text_prompt5 = """
   what happened my friend?
"""
speech_array5 = generate_audio(text_prompt5,history_prompt="v2/en_speaker_6")
# play text in notebook
Audio(speech_array5, rate=SAMPLE_RATE)

Output of the above code:

Audio output of sample code

In the above code, the generated 5-second audio for a simple line of speech has the last 3 seconds blank. To overcome this problem and generate good-quality audio for such cases, try using the parameter min_eos_p. This parameter in the generate_text_semantic function adjusts the threshold of Bark to generate text. By lowering this probability threshold value, we can stop text generation and solve the issue of extra added audio.

Here are the steps to follow to improve the generated audio:

  1. Use the generate_text_semantic function to generate semantic tokens from a given text.
  2. Reduce the value of min_eos_p parameter to 0.05(default value 0.2).
  3. Use the semantic_to_waveform function for generating a numpy audio array.

Due to the reduced probability threshold min_eos_p, text generation stops earlier and results in a small audio clip of 2 seconds. Please check the reference code below for more details.

from bark.api import semantic_to_waveform
from bark.generation import (generate_text_semantic,preload_models)
semantic_token5 = generate_text_semantic(text_prompt5,history_prompt="v2/en_speaker_6",
                  min_eos_p=0.05) # this controls how likely the generation is to end
speech_array6 = semantic_to_waveform(semantic_token5, history_prompt="v2/en_speaker_6")
# play text in notebook
Audio(speech_array6, rate=SAMPLE_RATE)

Conventional text-to-speech models generated robotic, machine-generated monotonous sounds that had limited usage. With deep learning algorithms, the latest TTS models can mimic human speech patterns and intonation. Due to advances in technology, it’s possible to create more engaging and naturally human-speaking applications like Emotional TTS, Singing TTS, Multilingual TTS, Voice cloning, etc.

Emerging trends in text-to-speech technology

Conclusion

Bark is an open-source GPT-style generative text-to-speech model that has a variety of applications. Bark use cases involve creating multilingual audiobooks and podcasts and generating sound effects for TV shows, video games, etc. It is most helpful in cases that require the generation of natural-sounding output, multi-speaker conversation, or music creation. Since Bark focuses on generating highly realistic human-like voices, sometimes additional background music/noise gets generated in the audio. If it is undesirable for the required use case, this noise can be removed using some external editing tools and software.

Key Takeaways

  • Bark is a highly realistic generative model that generates human-sounding, natural sound output.
  • It’s a unique model that can produce sound effects like laughing, crying, and music.
  • You can generate high-quality speech using Bark using audio formatting techniques and adjusting threshold parameters.

Frequently Asked Questions

Q1. What is the use of Bark AI?

Ans. Bark is a generative text-to-speech model which can produce highly expressive and emotive voices. It offers a fantastic experience of listening to actual human beings.

Q2. Is Bark AI available for commercial use?

Ans. Bark is a transformer-based text-to-speech model developed by Suno.ai. The model is licensed under the MIT License, meaning it’s available for commercial use.

Q3. Can I use a custom voice for cloning using Bark?

Ans. The original Bark model restricts the usage of generated voice to only limited speaker prompts available in the Bark speaker library. By using Hubert and Bark, it is possible to generate audio from a custom voice. Check here for more details.

The media shown in this article is not owned by Analytics Vidhya and is used at the author’s discretion.

Gandhali Joshi 06 Oct 2023

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers