10 Data Science Projects Every Beginner should add to their Portfolio
- The projects are a way to enhance and improve your knowledge in the data science domain.
- To boost your resume, here we have 10 data science projects as a beginner you can work upon
- By no means is this exhaustive. Feel free to add more data science projects in the comments below
With the rapid increase in the demand for data scientists in recent times, reports have shown that people are enrolling in high numbers for data science programs. Still, the industry is lacking in a skilled workforce in the AI domain. Why?
While hiring the data scientists companies expect, the candidates must have worked on some related projects. Just knowing about the tools and algorithms or having certifications is not enough. As a data scientist, you must have experience working on full-stack data science projects & data science project ideas and know-how about the tasks including preparing the problem statement, hypothesis making, gathering and cleaning the relevant data, Building the Ml pipeline, and deploying the model.
For simplifying your search for the relevant projects as a beginner, here we list down some data science projects you can try your hands on. The projects are divided into four parts
- CV projects
- NLP projects
- Time series projects
Computer vision Projects
Computer vision is one of the most popular applications of machine learning and everyone wants to explore it. This year we saw many interesting CV use cases. Here I am sharing a few you can get your hands dirty on.
If you are looking to master computer vision, check out our course Computer Vision using Deep Learning 2.0
Object detection with YOLO4
In recent times we have seen tremendous change in the state of the art real-time object detection models. The latest one is the release of YOLO4. You only look once (YOLO) is a family of one-stage object detectors that are fast and accurate. YOLO v4 showed very good results compared to other object detectors.
In experiments, YOLOv4 obtained an AP value of 43.5 percent (65.7 percent AP50) on the MS COCO dataset, and achieved a real-time speed of ∼65 FPS on the Tesla V100, beating the fastest and most accurate detectors in terms of both speed and accuracy.
YOLOv4 is twice as fast as EfficientNet with comparable performance. In addition, compared with YOLOv3, the AP and FPS have increased by 10 percent and 12 percent, respectively.
I will recommend you to go through the following links if you want to learn object detection.
- A Step-by-Step Introduction to the Basic Object Detection Algorithms
- A Practical Guide to Object Detection using the Popular YOLO Framework
Image classification with Microsoft Lobe
Recently, Microsoft launched its machine learning APP lobe, which aimed to make developing a machine learning model easier without writing a single line of code. It is an exciting image classification project for beginners.
Lobe automatically selects the right machine learning architecture and starts training without any setup or configuration. Further, users can evaluate the model’s strengths and weaknesses with real-time visual results. Once the training is done the model can be deployed on a website or device.
As a beginner, if you want to develop an image classification model, I will recommend you to try Lobe
Color Your pictures with ChromaGan
Image coloring is a very interesting problem. Here you have to fill the grayscale image with plausible colors. This can have multiple correct solutions. Before the emergence of deep learning techniques, the most effective methods relied on human intervention. Now we have various AI techniques including Generative networks.
ChromaGan is one such solution. It combines the strength of generative adversarial networks with semantic class distribution learning. As a result, ChromaGAN is able to perceptually colorize a grayscale image from the semantic understanding of the captured scene.
It is an interesting project to enhance your profile as a computer vision expert. Here is the ChromaGan paper, I will suggest you definitely go through it.
Natural Language Processing
NLP is one of the hottest fields in the machine learning industry with applications like chatbots, Topic modelling, and many more. Hence, the AI giants are investing large amounts in NLP researches.
Don’t forget to check the following links, if you are looking into NLP.
- Introduction to Natural Language Processing
- Certified Program: NLP for Beginners
- Natural Language Processing (NLP) Using Python
Electra ((Efficiently Learning an Encoder that Classifies Token Replacements Accurately) is a pre-training approach. It aims to match or exceed the downstream performance of a Masked language modelling pre-trained model used by BERT while using significantly less compute resources for the pre-training stage.
Setup for ELECTRA pre-training (Source — ELECTRA paper)
The pre-training task in ELECTRA is based on detecting replaced tokens in the input sequence. This setup requires two Transformer models, a generator and a discriminator similar to gan.
It is shown that the original ELECTRA approach yields an 85.0 score while ELECTRA 15% gets 82.4. (For comparison, BERT scored 82.2)
Topic Modeling with Top2Vec
Top2vec is an algorithm for discovering semantic structure or topics in a given set of documents. Basically, Top2vec uses doc2vec to generate semantic space.
This model does not require stop-word lists, stemming, or lemmatization, and it automatically finds the number of topics. The resulting topic vectors are jointly embedded with the document and word vectors with the distance between them representing semantic similarity.
ALBERT: A Lite BERT For Self-supervised Learning Of Language Representations
The ALBERT is a language representation model proposed in the paper ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. The model is a modified version of the traditional BERT model.
Generally, it is found that increased model size in language representation problems results in improved performance and a proportional increase in training time. To resolve this issue the authors have proposed two methods to reduce the memory consumption and training time of traditional BERT.
- Splitting the embedding matrix into two smaller matrices.
- Using repeating layers split among groups.
According to the researchers, this model outperformed the GLUE, RACE, and SQuAD benchmark tests for natural language understanding.
Time-series analysis is a powerful modelling technique that deals with observations having different values at different time stamps. It is a highly useful technique for companies for example forecasting the sales, traffic on the website, predicting stock prices, and more.
In case you are interested to dig it further here is your guide for time series Analysis
- 7 methods to perform Time Series forecasting (with Python codes)
- A Complete Tutorial on Time Series Modeling in R
- Free Course: Time Series Forecasting using Python
Time series classification is an interesting problem as the features here possess an order/ sequence, we can not avoid. For example, classifying ECG signals of a patient or the Motion Sensor Data.
Most of the state of the art methods used for time series classification have high complexity and significant learning time even on smaller datasets. Also, they are effectively unusable for large datasets. Rocket (RandOm Convolutional KErnel Transform) can achieve the same level of accuracy in just a fraction of time as competing with SOTA algorithms, including convolutional neural networks.
To achieve accuracy and scalability Rocket algorithm first uses randomized convolutional kernels to transform the time series features. Later, passes these transformed features into a classifier.
A prophet is an open-source tool by Facebook for forecasting time series data. Also, It decomposes time series into trend, seasonality, and holidays. In addition, Prophet has intuitive parameters that are easy to tune.
It is fully automatic, accurate, and fast, Hence making the prophet easy to use for someone who lacks deep expertise in time series forecast.
It works best with time series that have strong seasonal effects and several seasons of historical data. Also, Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
Here is a guide, if you want to know more about the implementation of time series forecasting using Prophet.
Gluon TS: Probabilistic Time Series Models in Python
Here, we have another library available for time series prediction at our end Gluon TS. It is a library for deep-learning-based time series modeling. It simplifies the development of and experimentation with time-series models for tasks such as forecasting or anomaly detection.
The library provides all the necessary tools and that scientists need for quickly building new models, for efficiently running and analyzing experiments, and for evaluating model accuracy.
GluonTS is available as open-source software on GitHub under the Apache License, version 2.0.
Here are some other projects that you can add to your portfolio to enhance it.
Recommender system with Tensorflow-Recommenders
From suggesting movies to products on Recommender systems are an important machine learning application. Recently, TensorFlow launched it’s package Tensorflow Recommenders. It is an open-source package that makes building, evaluating, and serving the recommender systems easy.
Further, the library is built on Keras, to have a smooth learning curve also giving you the flexibility to develop complex models. Further, I will suggest you read the official documentation and the tutorials provided by Tensorflow
Anamoly Detection with PyoD
Anamoly or outlier detection is a problem of identifying unusual patterns in the data. It is a process of identifying what is normal and what is not. Further, anomaly detection is defining a boundary around normal data points in order to distinguish them from the outliers.
Coming to PyoD ( The python outlier detection) is a comprehensive and scalable Python toolkit for outlier detection. It implements more than 30 algorithms. PyOD is developed with a comprehensive API to support multiple techniques and you can take a look at the official documentation of PyOD here.
Here is the detailed tutorial for you.
Develop Machine Learning web app with Streamlit
With the launch of streamlit developing a dashboard or web application for a machine-learning project has become incredibly easy using python only isn’t it exciting!
Streamlit is an open-source python library to build efficient, beautiful, and shareable web-based apps in very little time.
To install the library you can use the code below and it’s done
Pip install streamlit
Here is an interesting gif to make you understand how it works
Using streamlit we can develop from very simple to complex machine learning applications with few lines of code. Also, I personally like this tool as
- It uses python scripting no other language is needed
- Less code is required to create efficient applications
- Data caching speeds up the application
Excited to explore further! Here is the link for you Streamlit
The data science world is advancing at a high pace. Hence to stand in the competition it is required to be aware of the latest tools and techniques coming as a breakthrough in the industry.
In this article, I tried to cover a diverse set of projects in the data science domain, as a beginner you should definitely know about them. Now it’s your turn to get the hands-on experience.