10 Data Science Projects Every Beginner should add to their Portfolio

Shipra Saxena 30 Jul, 2021

9 min read

Overview

The projects are a way to enhance and improve your knowledge in the data science domain.
To boost your resume, here we have 10 data science projects as a beginner you can work upon
By no means is this exhaustive. Feel free to add more data science projects in the comments below

Introduction

With the rapid increase in the demand for data scientists in recent times, reports have shown that people are enrolling in high numbers for data science programs. Still, the industry is lacking in a skilled workforce in the AI domain. Why?

While hiring the data scientists companies expect, the candidates must have worked on some related projects. Just knowing about the tools and algorithms or having certifications is not enough. As a data scientist, you must have experience working on full-stack data science projects & data science project ideas and know-how about the tasks including preparing the problem statement, hypothesis making, gathering and cleaning the relevant data, Building the Ml pipeline, and deploying the model.

For simplifying your search for the relevant projects as a beginner, here we list down some data science projects you can try your hands on. The projects are divided into four parts

CV projects
NLP projects
Time series projects
Miscellaneous

Computer vision Projects

Computer vision is one of the most popular applications of machine learning and everyone wants to explore it. This year we saw many interesting CV use cases. Here I am sharing a few you can get your hands dirty on.

If you are looking to master computer vision, check out our course Computer Vision using Deep Learning 2.0

Object detection with YOLO4

In recent times we have seen tremendous change in the state of the art real-time object detection models. The latest one is the release of YOLO4. You only look once (YOLO) is a family of one-stage object detectors that are fast and accurate. YOLO v4 showed very good results compared to other object detectors.

In experiments, YOLOv4 obtained an AP value of 43.5 percent (65.7 percent AP50) on the MS COCO dataset, and achieved a real-time speed of ∼65 FPS on the Tesla V100, beating the fastest and most accurate detectors in terms of both speed and accuracy.

YOLOv4 is twice as fast as EfficientNet with comparable performance. In addition, compared with YOLOv3, the AP and FPS have increased by 10 percent and 12 percent, respectively.

Object detection with YOLO4 - data science projects

source

I will recommend you to go through the following links if you want to learn object detection.

Image classification with Microsoft Lobe

Recently, Microsoft launched its machine learning APP lobe, which aimed to make developing a machine learning model easier without writing a single line of code. It is an exciting image classification project for beginners.

data science projects - Image classification with Microsoft Lobe

source

Lobe automatically selects the right machine learning architecture and starts training without any setup or configuration. Further, users can evaluate the model’s strengths and weaknesses with real-time visual results. Once the training is done the model can be deployed on a website or device.

As a beginner, if you want to develop an image classification model, I will recommend you to try Lobe

Color Your pictures with ChromaGan

Image coloring is a very interesting problem. Here you have to fill the grayscale image with plausible colors. This can have multiple correct solutions. Before the emergence of deep learning techniques, the most effective methods relied on human intervention. Now we have various AI techniques including Generative networks. ChromaGan

source

ChromaGan is one such solution. It combines the strength of generative adversarial networks with semantic class distribution learning. As a result, ChromaGAN is able to perceptually colorize a grayscale image from the semantic understanding of the captured scene.

It is an interesting project to enhance your profile as a computer vision expert. Here is the ChromaGan paper, I will suggest you definitely go through it.

Natural Language Processing

NLP is one of the hottest fields in the machine learning industry with applications like chatbots, Topic modelling, and many more. Hence, the AI giants are investing large amounts in NLP researches.

Don’t forget to check the following links, if you are looking into NLP.

Electra

Electra ((Efficiently Learning an Encoder that Classifies Token Replacements Accurately) is a pre-training approach. It aims to match or exceed the downstream performance of a Masked language modelling pre-trained model used by BERT while using significantly less compute resources for the pre-training stage.

electra - data science projects

Setup for ELECTRA pre-training (Source — ELECTRA paper)

The pre-training task in ELECTRA is based on detecting replaced tokens in the input sequence. This setup requires two Transformer models, a generator and a discriminator similar to gan.

It is shown that the original ELECTRA approach yields an 85.0 score while ELECTRA 15% gets 82.4. (For comparison, BERT scored 82.2)

Here is the Github link and Electra Paper.

Topic Modeling with Top2Vec

Top2vec is an algorithm for discovering semantic structure or topics in a given set of documents. Basically, Top2vec uses doc2vec to generate semantic space.

data sciecne projects - Topic Modeling with Top2Vec

This model does not require stop-word lists, stemming, or lemmatization, and it automatically finds the number of topics. The resulting topic vectors are jointly embedded with the document and word vectors with the distance between them representing semantic similarity.

The Top2vec Authors have also provided the open-source API to experiment with. Further, you can dig deeper into the model through the paper.

ALBERT: A Lite BERT For Self-supervised Learning Of Language Representations

The ALBERT is a language representation model proposed in the paper ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. The model is a modified version of the traditional BERT model.

albert

Generally, it is found that increased model size in language representation problems results in improved performance and a proportional increase in training time. To resolve this issue the authors have proposed two methods to reduce the memory consumption and training time of traditional BERT.

Splitting the embedding matrix into two smaller matrices.
Using repeating layers split among groups.

According to the researchers, this model outperformed the GLUE, RACE, and SQuAD benchmark tests for natural language understanding.

For better understanding don’t forget to read the Albert paper. Here, you can find the model documentation and implementation for ALBERT.

Time-series Analysis

Time-series analysis is a powerful modelling technique that deals with observations having different values at different time stamps. It is a highly useful technique for companies for example forecasting the sales, traffic on the website, predicting stock prices, and more.

In case you are interested to dig it further here is your guide for time series Analysis

Rocket

Time series classification is an interesting problem as the features here possess an order/ sequence, we can not avoid. For example, classifying ECG signals of a patient or the Motion Sensor Data.

rocket

source:https://arxiv.org/pdf/1910.13051.pdf

Most of the state of the art methods used for time series classification have high complexity and significant learning time even on smaller datasets. Also, they are effectively unusable for large datasets. Rocket (RandOm Convolutional KErnel Transform) can achieve the same level of accuracy in just a fraction of time as competing with SOTA algorithms, including convolutional neural networks.

To achieve accuracy and scalability Rocket algorithm first uses randomized convolutional kernels to transform the time series features. Later, passes these transformed features into a classifier.

You can find its implementation in sktime and here is an example notebook. Further, you can also go for the paper to understand the approach better.

Prophet

A prophet is an open-source tool by Facebook for forecasting time series data. Also, It decomposes time series into trend, seasonality, and holidays. In addition, Prophet has intuitive parameters that are easy to tune.

It is fully automatic, accurate, and fast, Hence making the prophet easy to use for someone who lacks deep expertise in time series forecast.

prophet

It works best with time series that have strong seasonal effects and several seasons of historical data. Also, Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

Here is a guide, if you want to know more about the implementation of time series forecasting using Prophet.

Generate Quick and Accurate Time Series Forecasts using Facebook’s Prophet (with Python & R codes)

Gluon TS: Probabilistic Time Series Models in Python

Here, we have another library available for time series prediction at our end Gluon TS. It is a library for deep-learning-based time series modeling. It simplifies the development of and experimentation with time-series models for tasks such as forecasting or anomaly detection.

LUON

The library provides all the necessary tools and that scientists need for quickly building new models, for efficiently running and analyzing experiments, and for evaluating model accuracy.

GluonTS is available as open-source software on GitHub under the Apache License, version 2.0.

Miscellaneous

Here are some other projects that you can add to your portfolio to enhance it.

Recommender system with Tensorflow-Recommenders

From suggesting movies to products on Recommender systems are an important machine learning application. Recently, TensorFlow launched it’s package Tensorflow Recommenders. It is an open-source package that makes building, evaluating, and serving the recommender systems easy.

TensorFlow

Further, the library is built on Keras, to have a smooth learning curve also giving you the flexibility to develop complex models. Further, I will suggest you read the official documentation and the tutorials provided by Tensorflow

Anamoly Detection with PyoD

Anamoly or outlier detection is a problem of identifying unusual patterns in the data. It is a process of identifying what is normal and what is not. Further, anomaly detection is defining a boundary around normal data points in order to distinguish them from the outliers.

Anamoly detection

Coming to PyoD ( The python outlier detection) is a comprehensive and scalable Python toolkit for outlier detection. It implements more than 30 algorithms. PyOD is developed with a comprehensive API to support multiple techniques and you can take a look at the official documentation of PyOD here.

Here is the detailed tutorial for you.

An Awesome Tutorial to Learn Outlier Detection in Python using PyOD Library

Develop Machine Learning web app with Streamlit

Suppose you have created a project for tweet sentiment analysis that is efficiently working with high accuracy. If you want to demonstrate your project, you need to develop a dashboard using HTML or javascript. It’s a tedious task in itself if you do not know any of the scripting languages.

With the launch of streamlit developing a dashboard or web application for a machine-learning project has become incredibly easy using python only isn’t it exciting!

Streamlit is an open-source python library to build efficient, beautiful, and shareable web-based apps in very little time.

To install the library you can use the code below and it’s done

Pip install streamlit

Here is an interesting gif to make you understand how it works

Develop Machine Learning web app with Streamlit

source

Using streamlit we can develop from very simple to complex machine learning applications with few lines of code. Also, I personally like this tool as

It uses python scripting no other language is needed
Less code is required to create efficient applications
Data caching speeds up the application

Excited to explore further! Here is the link for you Streamlit

Endnote

The data science world is advancing at a high pace. Hence to stand in the competition it is required to be aware of the latest tools and techniques coming as a breakthrough in the industry.

In this article, I tried to cover a diverse set of projects in the data science domain, as a beginner you should definitely know about them. Now it’s your turn to get the hands-on experience.

data science projects

S

Shipra Saxena 30 Jul, 2021

Beginner Datasets Github Interview Questions Listicle

Frequently Asked Questions

Responses From Readers

Avanish singh 30 Dec, 2020

Valuable information for data science aspirations...

Stephanie 30 Dec, 2020

Can Someone with Background in Arts understand Data Science? If yes, what aspect of Data science will be most recommended for fighting online fraud.

SARAT 31 Dec, 2020

Need course details / or a WhatsApp no. To ask Python related doubt s

Lubaabah Dhikrullah 02 Jan, 2021

THANK you so much for this insightful article. I appreciate and do learn a lot from it especially the YOLOv4 and Lobe part. This is the first I'm reading about it. Dear Shipra Saxena, I have an observation and clarification to make. Based on my research on AI/ML/Data Science, I got to discover that Anaconda tools is the best to make use of and that is what I do use it. But in this article you didn't made mention of it. What's your view about my research? Is there anything wrong with it?