How to Clone Voice and Lip-Sync a Video Like a Pro Using Open-source Tools

Sunil Kumar Last Updated : 25 Jul, 2024

9 min read

Introduction

AI voice-cloning has taken social media by storm. It has opened a world of creative possibilities. You must have seen memes or AI voice-overs of famous personalities on social media. Have you wondered how it is done? Sure, many platforms provide APIs like Eleven Labs, but can we do it for free, using open-source software? The short answer is YES. The open-source has TTS models and ai lip-sync tools to achieve voice synthesis. So, in this article, we will explore open-source tools and models for voice-cloning and lip-syncing ai, also lip syncing deepfake face online how you can make it with the help of lip syncing tool how you can clone your voice with this tool.

Learning Objectives

Explore open-source tools for AI voice-cloning and lip-syncing ai.
Use FFmpeg and Whisper to transcribe videos.
Use the Coqui-AI’s xTTS model to clone voice.
Use the Wav2Lip for lip-syncing videos.
Explore real-world use cases of this technology.

This article was published as a part of the Data Science Blogathon.

Open-Source Stack
Workflow
Real-world Use Cases
Frequently Asked Questions

Open-Source Stack

As you already know, we will use OpenAI’s Whisper, FFmpeg, Coqui-ai’s xTTS model, and Wav2lip as our tech stack. But before delving into the codes, let’s briefly discuss these tools. And also thanks to the authors of these projects.

Take your AI innovations to the next level with GenAI Pinnacle. Fine-tune models like Gemini and unlock endless possibilities in NLP, image generation, and more. Dive in today! Explore Now

Whisper: Whisper is OpenAI’s ASR (Automatic Speech Recognition) model. It is an encoder-decoder transformer model trained with over 650k hours of diverse audio data and corresponding transcripts. Thus making it very potent at a multi-lingual transcription from audio.

The encoders receive the log-mel spectrogram of 30-second chunks of audio. Each encoder block uses self-attention to understand different parts of audio signals. The decoder then receives hidden state information from encoders and learned positional encodings. The decoder uses self-attention and cross-attention to predict the next token. At the end of the process, it outputs a sequence of tokens representing the recognized text. For more on Whisper, refer to the official repository.

Coqui TTS: TTS is an open-source library from Coqui-ai. It hosts multiple text-to-speech models. It has end-to-end models like Bark, Tortoise, and xTTS, spectrogram models like Glow-TTS, FastSpeech, etc, and Vocoders like Hifi-GAN, MelGAN, etc. Moreover, it provides a unified API for inferencing, fine-tuning, and training text-to-speech models. In this project, we will use xTTS, an end-to-end multi-lingual voice-cloning model. It supports 16 languages, including English, Japanese, Hindi, Mandarin, etc. For more information about the TTS, refer to the official TTS repository.

Wav2Lip: Wav2lip is a Python repository for the paper “A Lip Sync ai Expert Is All You Need for Speech to Lip Generation In the Wild.” It uses a lip-sync discriminator to recognize face and lip movements. This works out great for dubbing voices. For more information, refer to the official repository. We will use this forked repository of Wav2lip.

Workflow

Now that we are familiar with the tools and models we will use, let’s understand the workflow. This is a simple workflow. So, here is what we will do.

Upload a video to the Colab runtime and resize it to 720p format for better lip-syncing.
Use FFmpeg to extract 24-bit audio from the video and use Whisper to transcribe the audio file.
Use Google Translate or an LLM to translate the transcribed script to another language.
Load the Multi-lingual xTTS model with the TTS library and pass the script and reference audio model for voice synthesis.
Clone the Wav2lip repository and download model checkpoints. Run the inference.py file to sync the original video with synthesized audio.

Now, let’s delve into the codes.

Step 1: Install Dependencies

This project would require significant RAM and GPU consumption, so it is prudent to use a Colab runtime. The free tier Colab provides 12GB of CPU and 15GB of T4 GPU, which should be sufficient for this project. So, head over to your Colab and connect to a GPU runtime. If you’re working on tasks like lip sync AI, leveraging the computational power of a GPU can significantly accelerate the processing speed and improve the performance of your model.

Now, install the TTS and Whisper.

!pip install TTS
!pip install git+https://github.com/openai/whisper.git

Step 2: Upload Videos to Colab

Now, we will upload a video and resize it to 720p format. The Wav2lip tends to perform better when the videos are in 720p format. This can be done using FFmpeg.

#@title Upload Video

from google.colab import files
import os
import subprocess

uploaded = None
resize_to_720p = False

def upload_video():
  global uploaded
  global video_path  # Declare video_path as global to modify it
  uploaded = files.upload()
  for filename in uploaded.keys():
    print(f'Uploaded {filename}')
    if resize_to_720p:
        filename = resize_video(filename)  # Get the name of the resized video
    video_path = filename  # Update video_path with either original or resized filename
    return filename


def resize_video(filename):
    output_filename = f"resized_{filename}"
    cmd = f"ffmpeg -i {filename} -vf 'scale=-1:720' {output_filename}"
    subprocess.run(cmd, shell=True)
    print(f'Resized video saved as {output_filename}')
    return output_filename

# Create a form button that calls upload_video when clicked and a checkbox for resizing
import ipywidgets as widgets
from IPython.display import display

button = widgets.Button(description="Upload Video")
checkbox = widgets.Checkbox(value=False, description='Resize to 720p (better results)')
output = widgets.Output()

def on_button_clicked(b):
  with output:
    global video_path
    global resize_to_720p
    resize_to_720p = checkbox.value
    video_path = upload_video()

button.on_click(on_button_clicked)
display(checkbox, button, output)

This will output a form button for uploading videos from a local device and a checkbox for enabling 720p resizing. You can also upload a video manually to the current collab session and resize it using a subprocess.

Step 3: Audio Extraction and Whisper Transcription

Now that we have our video, the next thing we will do is extract audio using FFmpeg and use Whisper to transcribe.

# @title Audio extraction (24 bit) and whisper conversion
import subprocess

# Ensure video_path variable exists and is not None
if 'video_path' in globals() and video_path is not None:
    ffmpeg_command = f"ffmpeg -i '{video_path}' -acodec pcm_s24le -ar 48000 -q:a 0 -map a\
                       -y 'output_audio.wav'"
    subprocess.run(ffmpeg_command, shell=True)
else:
    print("No video uploaded. Please upload a video first.")

import whisper

model = whisper.load_model("base")
result = model.transcribe("output_audio.wav")

whisper_text = result["text"]
whisper_language = result['language']

print("Whisper text:", whisper_text)

This will extract audio from the video in 24-bit format and will use the Whisper Base to transcribe it. For better transcription, use Whisper small or medium models.

Step 4: Voice Synthesis

Now, to the voice cloning part. As I have mentioned before, we will use Coqui-ai’s xTTS model. This is one of the best open-source models out there for voice synthesis. Coqui-ai also provides many TTS models for different purposes; do check them. For our use case, which is voice-cloning, we will use the xTTS v2 model.

Load the xTTS model. This is a big model with a size of 1.87 GB. So, this will take a while.

# @title Voice synthesis
from TTS.api import TTS
import torch
from IPython.display import Audio, display  # Import the Audio and display modules

device = "cuda" if torch.cuda.is_available() else "cpu"
# Initialize TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

XTTS currently supports 16 languages. Here are the ISO codes of languages the xTTS model supports.

print(tts.languages)


['en','es','fr','de','it','pt','pl','tr','ru','nl','cs','ar','zh-cn','hu','ko','ja','hi']

Note: Languages like English and French do not have a character limit, while Hindi has a character limit of 250. Few other languages might have the limit as well.

For this project, we will use the Hindi language, you can experiment with others as well.

So, the first thing we need now is to translate the transcribed text into Hindi. This can either be done by Google Translate package or using an LLM. As per my observations, GPT-3.5-Turbo performs much better than Google Translate. We can use OpenAI API to get our translation.

import openai

client = openai.OpenAI(api_key = "api_key")
completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": f"translate the texts to Hindi {whisper_text}"}
  ]
)
translated_text = completion.choices[0].message
print(translated_text)

As we know, Hindi has a character limit, so we need to do text pre-processing before passing it to the TTS model. We need to split the text into chunks of less than 250 characters.

text_chunks = translated_text.split(sep = "।")
final_chunks = [""]
for chunk in text_chunks:
  if not final_chunks[-1] or len(final_chunks[-1])+len(chunk)<250:
    chunk += "।"
    final_chunks[-1]+=chunk.strip()
  else:
    final_chunks.append(chunk+"।".strip())
final_chunks

This is a very simple splitter. You can create a different one or use Langchain’s recursive text-splitter. Now, we will pass each chunk to the TTS model. The resulting audio files will be merged using FFmpeg.

def audio_synthesis(text, file_name):
  tts.tts_to_file(
      text,
      speaker_wav='output_audio.wav',
      file_path=file_name,
      language="hi"
  )
  return file_name
file_names = []
for i in range(len(final_chunks)):
    file_name = audio_synthesis(final_chunks[i], f"output_synth_audio_{i}.wav")
    file_names.append(file_name)

As all the files have the same codec, we can easily merge them with FFmpeg. To do this, create a Txt file and add the file paths.

# this is a comment
file 'output_synth_audio_0.wav'
file 'output_synth_audio_1.wav'
file 'output_synth_audio_2.wav'

Now, run the code below to merge files.

import subprocess

cmd = "ffmpeg -f concat -safe 0 -i my_files.txt -c copy final_output_synth_audio_hi.wav"
subprocess.run(cmd, shell=True)

This will output the final concatenated audio file. You can also play the audio in Colab.

from IPython.display import Audio, display
display(Audio(filename="final_output_synth_audio_hi.wav", autoplay=False))

Step 5: Lip-Syncing

Now, to the lip-syncing ai part. To lip-sync our synthetic audio with the original video, we will use the Wav2lip repository. To use Wav2lip to sync audio, we need to install the model checkpoints. But before that, if you are on T4 GPU runtime, delete the xTTS and Whisper models in the current Colab session or restart the session.

import torch

try:
    del tts
except NameError:
    print("Voice model already deleted")

try:
    del model
except NameError:
    print("Whisper model  deleted")

torch.cuda.empty_cache()

Now, clone the Wav2lip repository and install the checkpoints.

# @title Dependencies
%cd /content/

!git clone https://github.com/justinjohn0306/Wav2Lip
!cd Wav2Lip && pip install -r requirements_colab.txt

%cd /content/Wav2Lip

!wget 'https://github.com/justinjohn0306/Wav2Lip/releases \
/download/models/wav2lip.pth' -O 'checkpoints/wav2lip.pth'

!wget 'https://github.com/justinjohn0306/Wav2Lip/releases \
/download/models/wav2lip_gan.pth' -O 'checkpoints/wav2lip_gan.pth'

!wget 'https://github.com/justinjohn0306/Wav2Lip/releases \
/download/models/mobilenet.pth' -O 'checkpoints/mobilenet.pth'

!pip install batch-face

The Wav2lip has two models for lip-syncing. wav2lip and wav2lip_gan. According to the authors of the models, the GAN model requires less effort in face detection but produces slightly inferior results. In contrast, the non-GAN model can produce better results with more manual padding and rescaling of the detection box. You can try out both and see which one is doing better.

Run the inference with the model checkpoint path, video, and audio files.

%cd /content/Wav2Lip

#This is the detection box padding, adjust incase of poor results. 
#Usually, the bottom one is the biggest issue
pad_top =  0
pad_bottom =  15
pad_left =  0
pad_right =  0
rescaleFactor =  1

video_path_fix = f"'../{video_path}'"

!python inference.py --checkpoint_path 'checkpoints/wav2lip_gan.pth' \
--face $video_path_fix --audio "/content/final_output_synth_audio_hi.wav" \
--pads $pad_top $pad_bottom $pad_left $pad_right --resize_factor $rescaleFactor --nosmooth \ 
--outfile '/content/output_video.mp4'

This will output a lip-sync ai video. If the video doesn’t look good, adjust the parameters and retry.

So, here is the repository for the notebook and a few samples.

GitHub Repository: sunilkumardash9/voice-clone-and-lip-sync

Real-world Use Cases

Video voice-cloning and lip-syncing ai technology have a lot of use cases across industries. Here are a few cases where this can be beneficial.

Entertainment: The entertainment industry will be the most affected industry of all. We are already witnessing the change. Voices of celebrities of current and bygone eras can be synthesized and re-used. This also poses ethical challenges. The use of synthesized voices should be done responsively and within the perimeter of laws.

Marketing: Personalized ad campaigns with familiar and relatable voices can greatly enhance brand appeal.

Communication: Language has always been a barrier to all sorts of activities. Cross-language communication is still a challenge. Realtime end-to-end translation while keeping one’s accent and voice will revolutionize the way we communicate. This might become a reality in a few years.

Content Creation: Content creators will no longer depend on translators to reach a bigger audience. With efficient voice cloning and lip-syncing, cross-language content creation will be easier. Podcasts and audiobook narration experience can be enhanced with voice synthesis.

Conclusion

Voice synthesis is one of the most sought-after use cases of generative AI. It has the potential to revolutionize the way we communicate. Ever since the creation of civilizations, the language barrier between communities has been a hurdle for forging deeper relationships, culturally and commercially. With AI voice synthesis, this gap can be filled. So, in this article, we explored the open-source way of voice-cloning and lip-sync ai.

Hope you like the article and now you get clear understanding about how you can make lip syncing deepfake face online with he lip syncing tool now you can clon your voice.

Dive into the future of AI with GenAI Pinnacle. From training bespoke models to tackling real-world challenges like PII masking, empower your projects with cutting-edge capabilities. Start Exploring.

Key Takeaways

TTS, a Python library by Coqui-ai, serves and maintains popular text-to-speech models.
The xTTS is a multi-lingual voice cloning model capable of cloning voice to 16 different languages.
Whisper is an ASR model from OpenAI for efficient transcription and English translation.
Wav2lip is an open-source tool for lip-syncing videos.
Voice cloning is one of the most happening frontiers of generative AI, with a significant potential impact on industries from entertainment to marketing.

Frequently Asked Questions

Q1. Is AI voice cloning legal?

A. Cloning voice might be illegal as it infringes on copyright. However, getting permission from the person before cloning is the right way to go about it.

Q2. What is the AI tool for lip sync?

A. The AI tool for lip sync ai is called SyncVoice. It helps match the movement of lips with the audio.

Q3. What is the AI that makes your lips move?

A. The AI that makes your lips move is also SyncVoice. It uses advanced algorithms to synchronize lip movements with speech.

Q4. Is lip-synching illegal?

A. Lip-synching itself isn’t illegal, but using it to deceive or misrepresent in certain contexts, like performances or presentations, could be considered fraud or breach of contract.

Q5. What is the use of voice cloning?

A. Voice cloning can be beneficial for a range of use cases, such as content creation, narration in games and movies, Ad campaigns, etc.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Sunil Kumar

Meet your author Sunil kumar Dash, a developer and a writer. Has diverse interests in tech, pop culture, wellness, philosophy and Anime. Exploring underrated music is his hobby. And loves to doom scroll Twitter when bored.

Artificial Intelligence Audio Generative AI Intermediate Videos

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

How to Clone Voice and Lip-Sync a Video Like a Pro Using Open-source Tools

Introduction

Learning Objectives

Table of Contents

Open-Source Stack

Workflow

Step 1: Install Dependencies

Step 2: Upload Videos to Colab

Step 3: Audio Extraction and Whisper Transcription

Step 4: Voice Synthesis

Step 5: Lip-Syncing

Real-world Use Cases

Conclusion

Key Takeaways

Frequently Asked Questions

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)