Computer vision applications are ubiquitous right now. I honestly can’t remember the last time I went through an entire day without encountering or interacting with at least one computer vision use case (hello facial recognition on my phone!).
But here’s the thing – people who want to learn computer vision tend to get stuck in the theoretical concepts. And that’s the worst path you can take! To truly learn and master computer vision, we need to combine theory with practiceal experience.
And that’s where open source computer vision projects come in. You don’t need to spend a dime to practice your computer vision skills – you can do it sitting right where you are right now!
So in this article, I have coalesced and created a list of Open-Source Computer Vision projects based on the various applications of computer vision. There’s a LOT to go through and this is quite a comprehensive list so let’s dig in!
If you are completely new to computer vision and deep learning and prefer learning in video form, check this out:
Image classification is a fundamental task in computer vision. Here, the goal is to classify an image by assigning a specific label to it. It’s easy for us humans to comprehend and classify the images we see. But the case is very different for a machine. It is an onerous assignment for a machine to differentiate among a car and an elephant.
Here are two of the most prominent open-source projects for image classification:
The CIFAR-10 dataset is a collection of images that are commonly used to train machine learning and computer vision algorithms. It is one of the most popular datasets for machine learning research. It contains 60,000, 32×32 colour images in 10 different classes. The classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks.
The ImageNet dataset is a large visual database for use in computer vision research. More than 14 million images have been hand-annotated by the project to indicate what objects are pictured and in at least one million of the images, bounding boxes are also provided. ImageNet contains more than 20,000 categories!
As a beginner, you can start with a neural network from scratch using Keras or PyTorch. For better results and increasing the level of learning, I will advise using transfer learning through pre-trained models like VGG-16, Restnet- 50, Googlenet, etc.
I recommend going through the below article to know more about image classification:
I’d also suggest going through the below papers for a better understanding of image classification:
Face recognition is one of the prominent applications of computer vision. It’s used for security, surveillance, or in unlocking your devices. It is the task of identifying the faces in an image or video against a pre-existing database. We can use deep learning methods to learn the features of the faces and recognizing them.
It is a multi-stage process, consisting of the following steps:
The following open-source datasets will give you good exposure to face recognition-
MegaFace is a large-scale public face recognition training dataset that serves as one of the most important benchmarks for commercial face recognition problems. It includes 4,753,320 faces of 672,057 identities
Labeled Faces in the Wild (LFW) is a database of face photographs designed for studying the problem of unconstrained face recognition. It has 13,233 images of 5,749 people that were detected and collected from the web. Also, 1,680 of the people pictured have two or more distinct photos in the dataset.
In addition, for taking the project to an advanced stage, you can use pre-trained models like Facenet.
Facenet is a deep learning model that provides unified embeddings for face recognition, verification, and clustering task. The network maps each face image in euclidean space such that the distance between similar images is less.
You can easily use pre-trained Facenet models available in Keras and PyTorch to make your own face recognition system.
There is some more state of the art face recognition models are available you can experiment with. Deepface is a Deep CNN based network developed by Facebook researchers. It was a major milestone in the use of deep learning in a face recognition task.
To better understand the development in face recognition technology in the last 30 years, I’d encourage you to read an interesting paper titled:
Neural style transfer is a computer vision technology that recreates the content of one image in the style of the other image. It is an application of a Generative Adversarial Network (GAN). Here, we take two images – a content image and a style reference image and blend them together such that the output image looks like a content image painted in the style of the reference image.
This is implemented by optimizing the content statistics of output image matching to the content Image and Style statistics to the style reference image.
Here is the list of some awesome datasets to practice:
“COCO is a large-scale object detection, segmentation, and captioning dataset. The images in the dataset are everyday objects captured from everyday scenes. Further, it provides multi-object labeling, segmentation mask annotations, image captioning, and key-point detection with a total of 81 categories, making it a very versatile and multi-purpose dataset.
In case you are wondering how to implement the style transfer model, here is a TensorFlow tutorial that can help you out. Also, I will suggest you read the following papers if you want to dig deeper into the technology:
Detecting text in any given scene is another very interesting problem. Scene text is the text that appears on the images captured by a camera in an outdoor environment. For example, number plates of cars on roads, billboards on the roadside, etc.
The text in scene images varies in shape, font, color, and position. The complication in recognition of scene text further increases by non-uniform illumination and focus.
The following popular datasets will help you enrich your skills in analyzing Scene Text Detection:
The Street View House Numbers (SVHN) dataset is one of the most popular open source datasets out there. It has been used in neural networks created by Google to read house numbers and match them to their geolocations. This is a great benchmark dataset to play with, learn and train models that accurately identify street numbers. This dataset contains over 600k labeled real-world images of house numbers taken from Google Street View.
The scene text dataset comprises of 3000 images captured in different environments, including outdoors and indoors scenes under different lighting conditions. Images were captured either by the use of a high-resolution digital camera or a low-resolution mobile phone camera. Moreover, all images have been resized to 640×480.
Further, scene text detection is a two-step process consisting of Text Detection in the image and text recognition. For text detection, I found a state of the art deep learning method EAST (Efficient Accurate Scene Text Detector). It can find horizontal and rotated bounding boxes. You can use it in combination with any text recognition method.
Here are some other interesting papers on scene text detection:
Object detection is the task of predicting each object of interest present in the image through a bounding box along with proper labels on them.
A few months back, Facebook open-sourced its object detection framework- DEtection TRansformer (DETR). DETR is an efficient and innovative solution to object detection problems. It streamlines the training pipeline by viewing object detection as a direct set prediction problem. Further, it adopts an encoder-decoder architecture based on trans-formers.
To know more about DERT, here is the paper and Colab notebook.
Diversify your portfolio by working on the following open-sourced datasets for object detection:
Open Image is a dataset of ~9M images annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives. The dataset is split into a training set (9,011,219 images), a validation set (41,620 images), and a test set (125,436 images).
MS-COCO is a large scale dataset popularly used for object detection problems. It consists of 330K images with 80 object categories having 5 captions per image and 250,000 people with key points.
You can read the following resources to learn more about Object Detection:
When we talk about complete scene understanding in computer vision technology, semantic segmentation comes into the picture. It is the task of classifying all the pixels in an image into relevant classes of the objects.
Below is the list of open-source datasets to practice this topic:
This database is one of the first semantically segmented datasets to be released. This is often used in (real-time)semantic segmentation research. The dataset contains:
This dataset is a processed subsample of original cityscapes. The dataset has still images from the original videos, and the semantic segmentation labels are shown in images alongside the original image. This is one of the best datasets around for semantic segmentation tasks. It has 2975 training images files and 500 validation image files each of 256×512 pixels
To read further about semantic segmentation, I will recommend the following article:
Here are some papers available with code for semantic segmentation:
An autonomous car is a vehicle capable of sensing its environment and operating without human involvement. They create and maintain a map of their surroundings based on a variety of sensors that fit in different parts of the vehicle.
These vehicles have radar sensors that monitor the position of nearby vehicles. While the video cameras detect traffic lights, read road signs, track other vehicles and Lidar (light detection and ranging) sensors bounce pulses of light off the car’s surroundings to measure distances, detect road edges, and identify lane markings
Lane detection is an important part of these vehicles. In road transport, a lane is part of a carriageway that is designated to be used by a single line of vehicles to control and guide drivers and reduce traffic conflicts.
It is an exciting project to add on in your data scientist’s resume. The following are some datasets available to experiment with-
This dataset was part of the Tusimple Lane Detection Challenge. It contains 3626 video clips of 1-sec duration each. Each of these video clips contains 20 frames with an annotated last frame. It consists of training and test datasets with 3626 video clips, 3626 annotated frames in the training dataset, and 2782 video clips for testing.
In case, you are looking for some tutorial for developing the project check the article below-
Have you ever wished for some technology that could caption your social media images because neither you nor your friends are able to come up with a cool caption? Deep Learning for image captioning comes to your rescue.
Image captioning is the process of generating a textual description for an image. It is a combined task of computer vision and natural language processing (NLP).
Computer vision methods aid in understanding and extracting the feature from the input images. Further, NLP converts the image into the textual description in the correct order of words.
The following are some useful datasets to get your hands dirty with image captioning:
COCO is large-scale object detection, segmentation, and captioning dataset. It consists of of330K images (>200K labeled) with 1.5 million object instances and 80 object categories given 5 captions per image.
It is an image caption corpus consisting of 158,915 crowd-sourced captions describing 31,783 images. This is an extension of Flickr 8k Dataset. The new images and captions focus on people doing everyday activities and events.
If you are looking for the implementation of the project, I will suggest you look at the following article:
Also, I suggest you go through this prominent paper on Image Captioning.
Human Pose Estimation is an interesting application of Computer Vision. You must have heard about Posenet, which is an open-source model for Human pose estimation. In brief, pose estimation is a computer vision technique to infer the pose of a person or object present in the image/video.
Before discussing the working of pose estimation, let us first understand ‘Human Pose Skeleton’. It is the set of coordinates to define the pose of a person. A pair of coordinates is a limb. Further, pose estimation is performed by identifying, locating, and tracking the key points of Humans pose skeleton in an Image or video.
The following are some datasets if you want to develop a pose estimation model:
MPII Human Pose dataset is a state of the art benchmark for evaluation of articulated human pose estimation. The dataset includes around 25K images containing over 40K people with annotated body joints. Overall the dataset covers 410 human activities and each image has an activity label.
The HumanEva-I dataset contains 7 calibrated video sequences that are synchronized with 3D body poses. The database contains 4 subjects performing 6 common actions (e.g. walking, jogging, gesturing, etc.) that are split into training, validation, and testing sets.
I found DeepPose by Google as a very interesting research paper using deep learning models for pose estimation. In addition, you can visit multiple research papers available on the pose estimation to understand it better.
Facial expressions play a vital role in the process of non-verbal communication, as well as for identifying a person. They are very important in recognizing a person’s emotions. Consequently, information on facial expressions is often used in automatic systems of emotion recognition.
Emotion Recognition is a challenging task because emotions may vary depending on the environment, appearance, culture, and face reaction which leads to ambiguous data.
The face expression recognition system is a multistage process consisting of face image processing, feature extraction, and classification.
                                                                            source
Below is a dataset you can practice on:
Real-world Affective Faces Database (RAF-DB) is a large-scale facial expression database with around 30K great-diverse facial images. It consists of 29672Â real-world images, and 7-dimensional expression distribution vector for each image,
You can read these resources to increase your understanding further-
To conclude, in this article we discussed 10 interesting computer vision projects you can implement as a beginner. This is not an exhaustive list. So if you feel we missed something, feel free to add in the comments below!
Also, here I am listing down some useful CV resources to help you explore the deep learning and Computer vision world:
There is a lot of difference in the data science we learn in courses and self-practice and the one we work in the industry. I’d recommend you to go through these crystal clear free courses to understand everything about analytics, machine learning, and artificial intelligence:
I hope you find the discussion useful. Now it’s your turn to start the implementation of the computer vision on your own.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
Very well written Shipra. Can you share some code examples also to practice these datasets?