Video Summarization Using OpenAI Whisper and Hugging Chat API

Ritika Gupta 07 Sep, 2023
6 min read


“Less is more,” as architect Ludwig Mies van der Rohe famously said, and this is what summarization means. Summarization is a critical tool in reducing voluminous textual content into succinct, relevant morsels, appealing to today’s fast-paced information consumption. In text applications, summarization aids information retrieval, and supports decision-making. The integration of Generative AI, like OpenAI GPT-3-based models, has revolutionized this process by not only extracting key elements from text and generating coherent summaries that retain the source’s essence. Interestingly, Generative AI’s capabilities extend beyond text to video summarization. This involves extracting pivotal scenes, dialogues, and concepts from videos, creating abridged representations of the content. You can achieve video summarization in many different ways, including generating a short summary video, performing video content analysis, and highlighting key sections of the video or creating a textual summary of the video using video transcription

The Open AI Whisper API leverages automatic speech recognition technology to convert spoken language into written text, hence increasing accuracy and efficiency of text summarization. On the other hand, the Hugging Face Chat API provides state-of-the-art language models like GPT-3.

Learning Objectives

In this article we will learn about:

  • We learn about video summarization techniques
  • Understand the applications of Video Summarization
  • Explore the Open AI Whisper model architecture
  • Learn to implement the video textual summarization using the Open AI Whisper and Hugging Chat API

This article was published as a part of the Data Science Blogathon.

Video Summarization Techniques

Video Analytics

It involves the process of extracting meaningful information from a video. Use deep learning to track and identify objects and action in a video and identify the scenes. Some of the popular techniques for video summarization are:

Keyframe Extraction and Shot Boundary Detection

This process includes converting the video to a limited number of still pictures. Video skim is another term for this shorter video of keyshots.

Video shots are non-interrupted continuous series of frames. Shot boundary recognition detects transitions between shots, like cuts, fades, or dissolves, and chooses frames from each shot to build a summary. The below are the major steps to extract a continuous short video summary from a longer video:

  • Frame Extraction – Snapshot of video is extracted from video, we can take 1fps for 30 fps video.
  • Face and Emotion Detection – We can then extract faces from video & score the emotions of faces to detect emotion scores. Face detection using SSD (Single Shot Multibox Detector).
  • Frame Ranking & Selection – Select frames that have high emotion score and then rank.
  • Final Extraction – We extract subtitles from the video along with timestamps. We then extract the sentences corresponding to the extracted frames selected above, along with their starting and ending times in the video. Finally, we merge the video parts corresponding to these intervals to generate the final summary video.

Action Recognition and Temporal Subsampling

In this we try to identify human action performed in the video this is widely used application of Video analytics. We breakdown the video in small subsequences instead of frames and try to estimate the action performed in the segment  by classification and pattern recognition techniques like HMC (Hidden Markov Chain Analysis).

Single and Multi-modal Approaches

In this article we have used single modal approach where in we use the audio of video to create a summary of video using textual summary. Here we use a
single aspect of video which is audio convert it to text and then get summary using that text.

In multi-modal approach we combine information from many modalities like audio, visual, and text, give a holistic knowledge of the video content for more accurate summarization.

Applications of Video Summarization

Before diving into the implementation of our video summarization we should first know the applications of video summarization. Below are some of the listed examples of video summarization in a variety of fields and domains:

  • Security and Surveillance: Video summarization can allow us to analyze large amount of surveillance video to get important events highlight without manually reviewing the video
  • Education and Training: One can deliver key notes and training video thus students can revise the video contents without going through the whole video.
  • Content Browsing: Youtube uses this to highlight important part of video relevant to user search in order to allow users to decide they want to watch that particular video or not based on their search requirements.
  • Disaster Management: For emergencies and crisis video summarization can allow to take actions based on situations highlighted in the video summary.

Open AI Whisper Model Overview

The Whisper model of Open AI is a automatic speech recognition(ASR). It is used for transcribing speech audio into text.

 Architecture of Open AI Whisper Model
Architecture of Open AI Whisper Model

It is based on the transformer architecture, which stacks encoder and decoder blocks with an attention mechanism that propagates information between them. It will take the audio recording, divide it into 30-second pieces, and process each one individually. For each 30-second recording, the encoder encodes the audio and preserves the location of each word stated, and the decoder uses this encoded information to determine what was said.

The decoder will expect tokens from all of this information, which are basically each word pronounced. It will then repeat this process for the following word , utilising all of the same information to assist it identify the next one that makes more sense.

 Whisper model task flowchart
Whisper model task flowchart

Coding Example for Video Textual Summarization

 Flowchart of Textual Video Summarization
Flowchart of Textual Video Summarization

1 – Install and Load Libraries

!pip install yt-dlp openai-whisper hugchat
import yt_dlp
import whisper
from hugchat import hugchat

#Function for saving audio from input video id of youtube
def download(video_id: str) -> str:
    video_url = f'{video_id}'
    ydl_opts = {
        'format': 'm4a/bestaudio/best',
        'paths': {'home': 'audio/'},
        'outtmpl': {'default': '%(id)s.%(ext)s'},
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'm4a',
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        error_code =[video_url])
        if error_code != 0:
            raise Exception('Failed to download video')

    return f'audio/{video_id}.m4a'

#Call function with video id
file_path = download('A_JQK_k4Kyc&t=99s')

3 – Transcribe audio to text using Whisper

# Load whisper model
whisper_model = whisper.load_model("tiny")

# Transcribe audio function
def transcribe(file_path: str) -> str:
  # `fp16` defaults to `True`, which tells the model to attempt to run on GPU.
  transcription = whisper_model.transcribe(file_path, fp16=False)
  return transcription['text']

#Call the transcriber function with file path of audio  
transcript = transcribe('/content/audio/A_JQK_k4Kyc.m4a')

 4 – Summarize transcribed text using Hugging Chat

Note to use hugging chat api we need to login or sign up on hugging face platform. After that in place of “username” and “password” we need to pass in our hugging face credentials.

from hugchat.login import Login

# login
sign = Login("username", "password")
cookies = sign.login()

# load cookies from usercookies
cookies = sign.loadCookiesFromDir("/content") # This will detect if the JSON file exists, return cookies if it does and raise an Exception if it's not.

# Create a ChatBot
chatbot = hugchat.ChatBot(cookies=cookies.get_dict())  # or cookie_path="usercookies/<email>.json"

#Summarise Transcript
print('''Summarize the following :-'''+transcript))


In conclusion, the concept of summarization is a transformative force in information management. It’s a powerful tool that distills voluminous content into concise, meaningful forms, tailored to the fast-paced consumption of today’s world.

Through the integration of Generative AI models like OpenAI’s GPT-3, summarization has transcended its traditional boundaries, evolving into a process that not only extracts but generates coherent and contextually accurate summaries.

The journey into video summarization unveils its relevance across diverse sectors. The implementation of how audio extraction, transcription using Whisper, and summarization through Hugging Face Chat can be seamlessly integrated to create video textual summaries.

Key Takeaways

1. Generative AI: Video summarization can be achieved using generative AI technologies such as LLMs and ASR.

2. Applications in Fields:  Video summarization is actually beneficial in many important fields where one has to analyze large amount of videos to mine crucial information.

3. Basic Implementation:  In this article we explored basic code implementation of video summarization based on audio dimension.

4. Model Architecture: We also learnt about basic architecture of Open AI Whisper model and its process flow.

Frequently Asked Questions

Q1.  What are limits of Whisper API?

A. Whisper API call limit is 50 in a min. There is no audio length limit but files upto 25 MB can only be shared. One can reduce file size of audio by decreasing bitrate of audio.

Q2. The Whisper API supports which file formats?

A. The following file formats: m4a, mp3, webm, mp4, mpga, wav, mpeg

Q3. What are the alternatives of Whisper API?

A. Some of the major alternatives for Automatic Speech Recognition are – Twilio Voice, Deepgram, Azure speech-to-text, Google Cloud Speech-to-text.

Q4. What are the limitations of Automatic Speech Recognition (ASR) system?

A. One of the the difficulty in comprehending diverse accents of the same language, necessity for specialized training applications in specialized fields.

Q5. What are the alternatives to Automatic Speech Recognition (ASR)?

A. Advanced research is taking place in the field of speech recognition like decoding imagined speech from EEG signals using neural architecture. This allows people
with speech disabilities to communicate their thoughts of speech to outside world with help of devices. One such interesting paper here.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Ritika Gupta 07 Sep, 2023

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers