We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details

Step-by-Step Deep Learning Tutorial to Build Your Own Video Classification Model

PulkitS 25 Feb, 2024
14 min read

Overview

  • Learn how to use computer vision and deep learning techniques for video data.
  • We will build our video classification model in Python.
  • This is a hands-on tutorial for video classification, so ready for your Jupyter notebooks.

Introduction

I have written extensive articles and guides on how to build computer vision models using image data. Detecting objects in images, classifying those objects, generating labels from movie posters – there is so much we can do using computer vision and deep learning (subset of Machine Learning).

This time, I decided to turn my attention to the less-heralded aspect of computer vision – videos! We are consuming video content at an unprecedented pace. I feel this area of computer vision holds a lot of potential for data scientists.

I was curious about applying the same computer vision algorithms to video data. The approach I used for building image classification models – was it generalizable?

Videos can be tricky for machines to handle. Their dynamic nature, as opposed to an image’s static one, can make it complex for a data scientist to build those models.

But don’t worry, it’s not that different from working with image data. In this article, we will build our very own video classification model in Python. This is a very hands-on tutorial so fire up your Jupyter notebooks – this is going to a very fun ride.

Step-by-Step Deep Learning Tutorial to Build your own Video Classification Model-80

What we’ll cover in this Video Classification Tutorial

  1. Overview of Video Classification
  2. Steps to build our own Video Classification model
  3. Exploring the Video Classification dataset
  4. Training our Video Classification Model
  5. Evaluating our Video Classification Model

Overview of Video Classification

When you really break it down – how would you define videos?

We can say that videos are a collection of a set of images arranged in a specific order. These sets of images are also referred to as frames.

That’s why a video classification problem is not that different from an image classification problem. We take images for an image classification task, use feature extractors (like convolutional neural networks or CNNs) to extract features from images, and then classify that image based on these extracted features. Video classification involves just one extra step.

We first extract frames from the given video. We can then follow the same steps for an image classification task. This is the simplest way to deal with video data.

There are multiple other ways to deal with videos, and there is even a niche field of video analytics. I highly recommend going through the article below to understand how to deal with videos and extract frames in Python:

Deep Learning Tutorial to Calculate the Screen Time of Actors in any Video (with Python codes)

Also, we will be using CNNs to extract features from the frames of videos. Given their effectiveness and status as a state-of-the-art model in computer vision, CNNs are the chosen architecture for feature extraction in our video classification task. If you need a quick refresher on what CNNs are and how they work, this is where you should begin:

Architecture of Convolutional Neural Networks (CNNs) demystified

A Comprehensive Tutorial to learn Convolutional Neural Networks from Scratch

In our video classification task, we will be working with connectivity between frames, temporal information, and human actions. We will encode these aspects into tensors and assign class labels based on the extracted features. To enhance the model’s generalization capability, we will utilize transfer learning with architectures like ResNet.

Steps to build a Video Classification model

Excited to build a model that is able to classify videos into their respective categories? We will be working on the UCF101 – Action Recognition Data Set, which consists of 13,320 different video clips belonging to 101 distinct categories.

Let me summarize the steps that we will be following to build our video classification model:

Explore the video dataset and create the training and validation set. We will use the training set to train the model and the validation set to evaluate the trained model.

Extract frames from all the videos in the training as well as the validation set.

Preprocess these frames and then train a model using the frames in the training set. Evaluate the model using the frames present in the validation set.

Once we are satisfied with the performance on the validation set, use the trained model to classify new videos.

Let’s now start exploring the data!

Exploring the Video Classification dataset

You can download the dataset from the official UCF101 site. The dataset is in a .rar format so we first extract the videos from it. Create a new folder, let’s say ‘Videos’ (you can pick any other name as well), and then use the following command to extract all the downloaded videos:

unrar e UCF101.rar Videos/

The official documentation of UCF101 states that:

It is very important to keep the videos belonging to the same group separate in training and testing. Since the videos in a group are obtained from a single long video, sharing videos from the same group in training and testing sets would give high performance.”

So, we will split the dataset into the train and test sets as suggested in the official documentation. You can download the train/test split from here. Remember that you might require high computation power since we are dealing with a large dataset.

We now have the videos in one folder and the train/test splitting file in another folder. Next, we will create the dataset. Open your Jupyter notebook and follow the code block below. We will first import the required libraries:

import cv2 # for capturing videos

import math # for mathematical operations

import matplotlib.pyplot as plt # for plotting the images

%matplotlib inline

import pandas as pd

from keras.preprocessing import image # for preprocessing the images

import numpy as np # for mathematical operations

from keras.utils import np_utils

from skimage.transform import resize # for resizing images

from sklearn.model_selection import train_test_split

from glob import glob

from tqdm import tqdm

We will now store the name of videos in a dataframe:

Python Code

import pandas as pd

# open the .txt file which have names of training videos

f = open("trainlist01.txt", "r")

temp = f.read()

videos = temp.split('\n')

# creating a dataframe having video names

train = pd.DataFrame()

train['video_name'] = videos

train = train[:-1]

print(train.head())

This is how the names of videos are given in the .txt file. It is not properly aligned and we will need to preprocess it. Before that, let’s create a similar dataframe for test videos as well:

# open the .txt file which have names of test videos

f = open("testlist01.txt", "r")

temp = f.read()

videos = temp.split('\n') 

# creating a dataframe having video names

test = pd.DataFrame()

test['video_name'] = videos

test = test[:-1]

test.head()

Next, we will add the tag of each video (for both training and test sets). Did you notice that the entire part before the ‘/’ in the video name represents the video’s tag? Hence, we will split the entire string on ‘/’ and select the tag for all the videos:

# creating tags for training videos

train_video_tag = []

for i in range(train.shape[0]):   

      train_video_tag.append(train['video_name'][i].split('/')[0]) train['tag'] = train_video_tag 

# creating tags for test videos

test_video_tag = []

for i in range(test.shape[0]):   

     test_video_tag.append(test['video_name'][i].split('/')[0]) 

test['tag'] = test_video_tag

So what’s next? Now, we will extract the frames from the training videos which will be used to train the model. I will be storing all the frames in a folder named train_1.

So, first of all, make a new folder and rename it to ‘train_1’ and then follow the code given below to extract frames:

# storing the frames from training videos

for i in tqdm(range(train.shape[0])):

       count = 0 

videoFile = train['video_name'][i]

 cap = cv2.VideoCapture('UCF/'+videoFile.split(' ')[0].split('/')[1]) # capturing the video from the given path 

frameRate = cap.get(5) #frame rate 

x=1 

while(cap.isOpened()):

       frameId = cap.get(1) #current frame number 

       ret, frame = cap.read() 

       if (ret != True): 

           break 

       if (frameId % math.floor(frameRate) == 0):

 # storing the frames in a new folder named train_1 

         filename ='train_1/' + videoFile.split('/')[1].split(' ')[0] +"_frame%d.jpg" % count;count+=1 

        cv2.imwrite(filename, frame) 

cap.release()

This will take some time as more than 9,500 videos are in the training set. Once the frames are extracted, we will save the names of these frames with their corresponding tag in a .csv file. Creating this file will help us read the frames we will see in the next section.

# getting the names of all the images

images = glob("train_1/*.jpg")

train_image = []

train_class = []

for i in tqdm(range(len(images))): 

       # creating the image name  

       train_image.append(images[i].split('/')[1]) 

      # creating the class of image   

      train_class.append(images[i].split('/')[1].split('_')[1]) 

# storing the images and their class in a dataframe

train_data = pd.DataFrame()

train_data['image'] = train_image

train_data['class'] = train_class 

# converting the dataframe into csv file 

train_data.to_csv('UCF/train_new.csv',header=True, index=False)

So far, we have extracted frames from all the training videos and saved them in a .csv file with their corresponding tags. It’s time to train our model, which we will use to predict the video tags in the test set.

Training the Video Classification Model

It’s finally time to train our video classification model! I’m sure this is the most anticipated section of the tutorial. I have divided this step into sub-steps for ease of understanding:

  1. Read all the frames that we extracted earlier for the training images
  2. Create a validation set that will help us examine how well our model will perform on unseen data
  3. Define the architecture of our model
  4. Finally, train the model and save its weights

Reading all the video frames

So, let’s get started with the first step, where we will extract the frames. We will import the libraries first:

import keras

from keras.models import Sequential

from keras.applications.vgg16 import VGG16

from keras.layers import Dense, InputLayer, Dropout, Flatten

from keras.layers import Conv2D, MaxPooling2D, GlobalMaxPooling2D

from keras.preprocessing import image

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from tqdm import tqdm

from sklearn.model_selection import train_test_split

Remember, we created a .csv file that contains the names of each frame and their corresponding tag? Let’s read it as well:

train = pd.read_csv('UCF/train_new.csv')

train.head()

This is what the first five rows look like. We have the corresponding class or tag for each frame. Now, using this .csv file, we will read the frames that we extracted earlier and then store those frames as a NumPy array:

# creating an empty list

train_image = [] 

# for loop to read and store frames

for i in tqdm(range(train.shape[0])): 

      # loading the image and keeping the target size as (224,224,3) 

      img = image.load_img('train_1/'+train['image'][i], target_size=(224,224,3))

     # converting it to array

     img = image.img_to_array(img)

     # normalizing the pixel value 

    img = img/255

    # appending the image to the train_image list    

    train_image.append(img) 

# converting the list to numpy array

X = np.array(train_image) 

# shape of the array

X.shape

Output: (73844, 224, 224, 3)

We have 73,844 images each of size (224, 224, 3). Next, we will create the validation set.

Creating a validation set

To create the validation set, we need to ensure that each class’s distribution is similar in both training and validation sets. We can use the stratify parameter to do that:

# separating the target

y = train['class'] 

# creating the training and validation set

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2, stratify = y)

Here, stratify = y (which is the class or tags of each frame) keeps a similar distribution of classes in both the training and the validation set.

Remember – there are 101 categories in which a video can be classified. So, we will have to create 101 different columns in the target, one for each category. We will use the get_dummies() function for that:

# creating dummies of target variable for train and validation set

y_train = pd.get_dummies(y_train)y_test = pd.get_dummies(y_test)

Next step – define the architecture of our video classification model.

Defining the architecture of the video classification model

Since we do not have a very large dataset, creating a model from scratch might not work well. So, we will use a pre-trained model and take its learnings to solve our problem.

For this particular dataset, we will be using the VGG-16 pre-trained model. Let’s create a base model of the pre-trained model:

# creating the base model of pre-trained VGG16 model

base_model = VGG16(weights='imagenet', include_top=False)

This model was trained on a dataset that has 1,000 classes. We will fine-tune this model as per our requirements. include_top = False will remove the last layer of this model so that we can tune it as per our needs.

Now, we will extract features from this pre-trained model for our training and validation images:

# extracting features for training frames

X_train = base_model.predict(X_train)X_train.shape

Output: (59075, 7, 7, 512)

We have 59,075 images in the training set and the shape has been changed to (7, 7, 512) since we have passed these images through the VGG16 architecture. Similarly, we will extract features for validation frames:

# extracting features for validation frames

X_test = base_model.predict(X_test)X_test.shape

Output: (14769, 7, 7, 512)

There are 14,769 images in the validation set, and the shape of these images has also changed to (7, 7, 512). We will use a fully connected network now to fine-tune the model. This fully connected network takes input in a single dimension. So, we will reshape the images into a single dimension:

# reshaping the training as well as validation frames in single dimension

X_train = X_train.reshape(59075, 7*7*512)X_test = X_test.reshape(14769, 7*7*512)

It is always advisable to normalize the pixel values, i.e., keep them between 0 and 1. This helps the model to converge faster.

# normalizing the pixel values

max = X_train.max()X_train = X_train/maxX_test = X_test/max

Next, we will create the architecture of the model. We have to define the input shape for that. So, let’s check the shape of our images:

# shape of images

X_train.shape

Output: (59075, 25088)

The input shape will be 25,088. Let’s now create the architecture:

#defining the classifier model

 architecturemodel = Sequential()

model.add(Dense(1024, activation='relu', input_shape=(25088,)))

model.add(Dropout(0.5))model.add(Dense(512, activation='relu'))model.add(Dropout(0.5))

model.add(Dense(256, activation='relu'))

model.add(Dropout(0.5))model.add(Dense(128, activation='relu'))

model.add(Dropout(0.5))model.add(Dense(101, activation='softmax'))

We have multiple fully connected dense layers. I have added dropout layers as well so that the model will not overfit. The number of neurons in the final layer is equal to the number of classes that we have and hence the number of neurons here is 101.

Training the video classification model

We will now train our model using the training frames and validate the model using validation frames. We will save the weights of the model so that we will not have to retrain the model again and again.

So, let’s define a function to save the weights of the model:

# defining a function to save the weights of best model

from keras.callbacks import ModelCheckpoint

mcp_save = ModelCheckpoint('weight.hdf5', save_best_only=True, monitor='val_loss', mode='min')

We will decide the optimum model based on the validation loss. Note that the weights will be saved as weights.hdf5. You can rename the file if you wish. Before training the model, we have to compile it:

# compiling the model

model.compile(loss='categorical_crossentropy',optimizer='Adam',metrics=['accuracy'])

We are using the categorical_crossentropy as the loss function, and the optimizer is Adam. Let’s train the model:

# training the model
model.fit(X_train, y_train, epochs=200, validation_data=(X_test, y_test), callbacks=[mcp_save], batch_size=128)

I have trained the model for 200 epochs. You can use this link to download the weights I got after training the model.

We now have the weights we will use to make predictions for the new videos. So, in the next section, we will see how well this model performs the task of video classification!

Evaluating our Video Classification Model

Let’s open a new Jupyter Notebook to evaluate the model. The evaluation part can also be split into multiple steps to understand the process more clearly:

  1. Define the model architecture and load the weights
  2. Create the test data
  3. Make predictions for the test videos
  4. Finally, evaluate the model

Defining model architecture and loading weights

You’ll be familiar with the first step – importing the required libraries:

from keras.models import Sequential

from keras.layers import Dense, Dropout, Flattenfrom 

keras.layers import Conv2D, MaxPooling2Dfrom 

keras.preprocessing import image

import numpy as np

import pandas as pd

from tqdm import tqdm

from keras.applications.vgg16 import VGG16

import cv2i

mport math

import os

from glob import glob

from scipy import stats as s

Next, we will define the model architecture which will be similar to what we had while training the model:

base_model = VGG16(weights='imagenet', include_top=False)

This is the pre-trained model and we will fine-tune it next:

#defining the model

architecturemodel = Sequential()

model.add(Dense(1024, activation='relu', input_shape=(25088,)))

model.add(Dropout(0.5))model.add(Dense(512, activation='relu'))

model.add(Dropout(0.5))

model.add(Dense(256, activation='relu'))

model.add(Dropout(0.5))model.add(Dense(128, activation='relu'))

model.add(Dropout(0.5))model.add(Dense(101, activation='softmax'))

Now, as we have defined the architecture, we will now load the trained weights which we stored as weights.hdf5:

# loading the trained weights

model.load_weights("weights.hdf5")

Compile the model as well:

# compiling the model

model.compile(loss='categorical_crossentropy',optimizer='Adam',metrics=['accuracy'])

Make sure that the loss function, optimizer, and the metrics are the same as we used while training the model.

If you’re new to the world of deep learning and computer vision, we have the perfect course for you to begin your journey:

Computer Vision using Deep Learning

Creating the test data

You should have downloaded the train/test split files as per the official documentation of the UCF101 dataset. If not, download it from here. In the downloaded folder, there is a file named “testlist01.txt” which contains the list of test videos. We will make use of that to create the test data:

getting the test list

f = open("testlist01.txt", "r")t

emp = f.read()

videos = temp.split('\n')

# creating the dataframetest = pd.DataFrame()test['video_name'] = videostest = test[:-1]test_videos = test['video_name']test.head()

We now have the list of all the videos stored in a dataframe. To map the predicted categories with the actual categories, we will use the train_new.csv file:

# creating the tagstrain = pd.read_csv('UCF/train_new.csv')y = train['class']y = pd.get_dummies(y)

Now, we will make predictions for the videos in the test set.

Generating predictions for test videos

Let me summarize what we will be doing in this step before looking at the code. The below steps will help you understand the prediction part:

  1. First, we will create two empty lists – one to store the predictions and the other to store the actual tags
  2. Then, we will take each video from the test set, extract frames for this video and store it in a folder (create a folder named temp in the current directory to store the frames). We will remove all the other files from this folder at each iteration
  3. Next, we will read all the frames from the temp folder, extract features for these frames using the pre-trained model, predict tags, and then take the mode to assign a tag for that particular video and append it in the list
  4. We will append actual tags for each video in the second list

Let’s code these steps and generate predictions:

# creating two lists to store predicted and actual tags

predict = []

actual = [] 

# for loop to extract frames from each test video

for i in tqdm(range(test_videos.shape[0])): 

count = 0 

videoFile = test_videos[i] 

cap = cv2.VideoCapture('UCF/'+videoFile.split(' ')[0].split('/')[1]) 

# capturing the video from the given path

 frameRate = cap.get(5)  #frame rate

 x=1 

# removing all other files from the temp folder

 files = glob('temp/*') 

for f in files: 

os.remove(f) 

while(cap.isOpened()): 

frameId = cap.get(1) #current frame number

 ret, frame = cap.read() 

if (ret != True): 

break 

if (frameId % math.floor(frameRate) == 0): 

# storing the frames of this particular video in temp folder

 filename ='temp/' + "_frame%d.jpg" % count;count+=1 cv2.imwrite(filename, frame) 

cap.release() 

# reading all the frames from temp folder

 images = glob("temp/*.jpg")

 prediction_images = [] 

for i in range(len(images)): 

img = image.load_img(images[i], target_size=(224,224,3))

 img = image.img_to_array(img) 

img = img/255

 prediction_images.append(img) 

# converting all the frames for a test video into numpy array prediction_images = np.array(prediction_images) 

# extracting features using pre-trained model prediction_images = base_model.predict(prediction_images) 

# converting features in one dimensional array 

prediction_images = prediction_images.reshape(prediction_images.shape[0], 7*7*512)

 # predicting tags for each array 

prediction = model.predict_classes(prediction_images)

 # appending the mode of predictions in predict list to assign the tag to the video predict.append(y.columns.values[s.mode(prediction)[0][0]]) 

# appending the actual tag of the video actual.append(videoFile.split('/')[1].split('_')[1])

This step will take some time as there are around 3,800 videos in the test set. Once we have the predictions, we will calculate the performance of the model. 

Evaluating the model

Time to evaluate our model and see what all the fuss was about.

We have the actual tags as well as the tags predicted by our model. We will make use of these to get the accuracy score. On the official documentation page of UCF101, the current accuracy is 43.90%. Can our model beat that? Let’s check!

# checking the accuracy of the predicted tags

from sklearn.metrics import accuracy_score

accuracy_score(predict, actual)*100

Output: 44.80570975416337

Great! Our model’s accuracy of 44.8% is comparable to what the official documentation states (43.9%).

You might be wondering why we are satisfied with a below 50% accuracy. Well, the reason behind this low accuracy is majorly due to lack of data. We only have around 13,000 videos and even those are of a very short duration.

Conclusion

In this article, we covered one of the most interesting applications of computer vision – video classification. We first understood how to deal with videos, then we extracted frames, trained a video classification model, and finally got a comparable accuracy of 44.8% on the test videos.

We can now try different approaches and aim to improve the performance of the model. Some approaches which I can think of are to use 3D Convolutions (3d cnn) which can directly deal with videos.

Since videos are a sequence of frames, we can solve it as a sequence problem as well. So, there can be multiple more solutions to this and I suggest you explore them. Feel free to share your findings with the community.

As always, if you have any suggestions or doubts related to this article, post them in the comments section below and I will be happy to answer them. And as I mentioned earlier, do check out the computer vision course if you’re new to this field.

Frequently Asked Questions

Q1. Can Lstm be used for video classification?

A. Yes, Long Short-Term Memory(LSTM) networks are suitable for video classification, especially when capturing long-term dependencies and temporal sequences in videos is essential.

Q2. What is Video Classification in deep learning ?

A. Video classification involves categorizing videos into predefined classes or labels. Deep learning models analyze temporal patterns to recognize actions, events, or objects within the video.

Q3. What are the challenges of video classification using deep learning techniques?

A. Video classification with deep learning poses challenges in capturing temporal dependencies, managing computational complexity, and handling intricate data annotation.

Q4. What are the best practices for building and training neural networks using TensorFlow or Keras?

A. Best practices for building and training neural networks with TensorFlow or Keras include data preprocessing(normalize and augment), choosing appropriate model architecture, incorporating regularization techniques(dropout, batch normalization), experimenting with optimizers and learning rates , and monitoring and tuning hyperparameters regularly.

Q5. What is the difference between Rnn (Recurrent Neural Networks) and Cnn (Convolutional Neural Network) for video classification?

A. RNNs capture temporal dependencies for video classification, while CNNs extract spatial features from individual frames. Hybrid models like 3D CNNs combine both aspects for comprehensive analysis.

Q6. What are some of the most innovative ways neural network are transforming computer vision?

A. Neural networks in computer vision innovate through efficient object detection(e.g., Yolo, Faster R-CNN), detailed semantic segmentation, and the generation of realistic images using models like GANs and VAEs.

PulkitS 25 Feb, 2024