7 Data Science Projects on GitHub to Showcase your Skills!

Last Updated : 06 Dec, 2023

7 min read

Overview

Working on Data Science projects is a great way to stand out from the competition
Check out these 7 data science projects on GitHub that will enhance your budding skillset
These GitHub repositories include projects from a variety of data science fields – machine learning, computer vision, reinforcement learning, among others

Introduction

Are you ready to take that next big step in your data science journey? Working on small datasets and using popular data science libraries and frameworks is a good start. But if you truly want to stand out from the competition, you need to take a leap and differentiate yourself.

A brilliant way to do this is to do a project on the latest breakthroughs in data science. Want to become a Computer Vision expert? Learn how the latest object detection algorithm works. If Natural Language Processing (NLP) is your calling, then learn about the various aspects and off-shoots of the Transformer architecture.

My point is – always be ready and willing to work on new data science techniques. This is one of the fastest-growing fields in the industry and we as data scientists need to grow along with it.

So, let’s check out 7 data science GitHub projects that can help you upskill your knowledge. As always, I have kept the domain broad to include projects from machine learning to reinforcement learning.

And if you have come across any library that isn’t on this list, let the community know in the comments section below this article!

Top Data Science GitHub Projects
Machine Learning Projects
- pyforest – Importing all Python Data Science Libraries in One Line of Code
- HungaBunga – A Different Way of Building Machine Learning Models using sklearn
Deep Learning Projects
Programming Projects
- ggtext – Improved Text Rendering for ggplot2

Top Data Science GitHub Projects

I have divided these data science projects into three broad categories:

Machine Learning Projects
Deep Learning Projects
Programming Projects

Machine Learning Projects

pyforest – Importing all Python Data Science Libraries in One Line of Code

I really, really like this Python library. As the above heading suggests, your typical data science libraries are imported using just one library – pyforest. Check out this quick demo I’ve taken from the library’s GitHub repository:

pyforest_demo_data_science_github_project

Excited yet? pyforest currently includes pandas, NumPy, matplotlib, and many more data science libraries.

Just use pip install pyforest to install the library on your machine and you’re good to go. And you can import all the popular Python libraries for data science in just one line of code:

from pyforest import *

Awesome! I’m thoroughly enjoying using this and I’m certain you will as well. You should check out the below free course on Python if you’re new to the language:

Python for Data Science

Click here to explore this Github data science project.

HungaBunga – A Different Way of Building Machine Learning Models using sklearn

How do you pick the best machine learning model from the ones you’ve built? How do you ensure the right hyperparameter values are in play? These are critical questions a data scientist needs to answer.

And the HungaBunga project will help you reach that answer faster than most data science libraries. It runs through all the sklearn models (yes, all!) with all the possible hyperparameters and ranks them using cross-validation.

Here’s how to import all the models (both classification and regression):

from hunga_bunga import HungaBungaClassifier, HungaBungaRegressor

You should check out the below comprehensive article on supervised machine learning algorithms Commonly used Machine Learning Algorithms (with Python and R Codes)

Click here to explore this Github data science project.

Deep Learning Projects

Behavior Suite for Reinforcement Learning (bsuite) by DeepMind

Deepmind has been in the news recently for the huge losses they have posted year-on-year. But let’s face it, the company is still clearly ahead in terms of its research in reinforcement learning. They have bet big on this field as the future of artificial intelligence.

So here comes their latest open source release – the bsuite. This project is a collection of experiments that aims to understand the core capabilities of a reinforcement learning agent.

I like this area of research because it is essentially trying to fulfill two objectives (as per their GitHub repository):

Collect informative and scalable issues that capture key problems in the design of efficient and general learning algorithms
Study the behavior of agents via their performance on these shared benchmarks

The GitHub repository contains a detailed explanation of how to use bsuite in your projects. You can install it using the below code:

pip install git+git://github.com/deepmind/bsuite.git

If you’re new to reinforcement learning, here are a couple of articles to get you started:

Click here to explore this Github data science project.

DistilBERT – A Lighter and Cheaper Version of Google’s BERT

You must have heard of BERT at this point. It is one of the most popular and quickly becoming a widely-adopted Natural Language Processing (NLP) framework. BERT is based on the Transformer architecture.

But it comes with one caveat – it can be quite resource-intensive. So how can data scientists work on BERT on their own machines? Step up – DistilBERT!

DistilBERT, short for Distillated-BERT, comes from the team behind the popular PyTorch-Transformers framework. It is a small and cheap Transformer model built on the BERT architecture. According to the team, DistilBERT runs 60% faster while preserving over 95% of BERT’s performances.

This GitHub repository explains how DistilBERT works along with the Python code. You can learn more about PyTorch-Transformers and how to use it in Python here:

Introduction to PyTorch-Transformers: An Incredible Library for State-of-the-Art NLP (with Python code)

Click here to explore this Github data science project.

ShuffleNet Series – An Extremely Efficient Convolutional Neural Network for Mobile Devices

A computer vision project for you! ShuffleNet is an extremely computation-efficient convolutional neural network (CNN) architecture. It has been designed for mobile devices with very limited computing power.

This GitHub repository includes the below ShuffleNet models (yes, there are multiple):

ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices
ShuffleNetV2: Practical Guidelines for Efficient CNN Architecture Design
ShuffleNetV2+: A strengthened version of ShuffleNetV2.
ShuffleNetV2.Large: A deeper version based on ShuffleNetV2.
OneShot: Single Path One-Shot Neural Architecture Search with Uniform Sampling
DetNAS: DetNAS: Backbone Search for Object Detection

So are you looking to understand CNNs? You know I have you covered:

A Comprehensive Tutorial to learn Convolutional Neural Networks from Scratch

Click here to explore this data science project.

RAdam – Improving the Variance of Learning Rates

RAdam was released less than two weeks ago and it has already accumulated 1200+ stars. That tells you a lot about how well this repository is doing!

The developers behind RAdam show in their paper that the convergence issue we face in deep learning techniques is due to the undesirably big variance of the adaptive learning rate in the early stages of model training.

RAdam is a new variant of Adam, that rectifies the variance of the adaptive learning rate. This release brings a solid improvement over the vanilla Adam optimizer which does suffer from the issue of variance.

Here is the performance of RAdam compared to Adam and SGD with different learning rates (X-axis is the number of epochs):

You should definitely check out the below guide on optimization in machine learning (including Adam):

Introduction to Gradient Descent Algorithm (along with variants) in Machine Learning

Checkout this data science project here.

Programming Projects

ggtext – Improved Text Rendering for ggplot2

This one is for all the R users in our community. And especially all of you who work regularly with the awesome ggplot2 package (which is basically everyone).

The ggtext package enables us to produce rich-text rendering for the plots we generate. Here are a few things you can try out using ggtext:

A new theme element called element_markdown() renders the text as markdown or HTML
You can include images on the axis (as shown in the above picture)
Use geom_richtext() to produce markdown/HTML labels (as shown below)

The GitHub repository contains a few intuitive examples which you can replicate on your own machine.

ggtext is not yet available through CRAN so you can download and install it from GitHub using this command:

devtools::install_github("clauswilke/ggtext")

Want to learn more about ggplot2 and how to work with interactive plots in R? Here you go:

Click here to explore this data science project.

End Notes

I love working on these monthly articles. The amount of research and hence breakthroughs happening in data science are extraordinary. No matter which era or standard you compare it with, the rapid advancement is staggering.

Which data science project did you find the most interesting? Will you be trying anything out soon? Let me know in the comments section below and we’ll discuss ideas!

Advanced Data Science Deep Learning Github Interview Prep

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Januka

Great article. I am not in data science but fascinated by these techniques, In particular, RAdam optimization. Can you show an example where waveform fitting (e.g. seismic waves) is optimized with a similar technique? Thanks!

Saikiran

Thanks Pranav. These projects are fabulous. One question on Hunga Bunga. For some of the classifiers like LR and Naive Bayes, we need to do scaling or normalization of features. So does this library handle them as well without we doing it at our end.

A. Matheny

This article is written with a brilliance and enthusiasm about a topic I enjoy reading about and trying to understand but I'm no data scientist. I'm a but reluctant to download on my phone but I may look into the links on my laptop. Thanks for sharing.

Mitesh Sharma

Great informative article on Data Science. I am a student of Data Science and I have bought one of your course on Data Science from Intershala.

Reading list

Introduction to Deep Learning

Feed Forward Networks

Feed Forward Networks

Gradient Descent

Loss Function

Activation Functions

Introduction to Neural networks

Forward and Backward Propagation

Optimizers

Learning Rate Schedulers

NN on Structured Data

Improving the Deep Learning Model

Deep Learning Model Optimization

Unsupervised Deep Learning

AutoDL

Model Deployment

Introduction to PyTorch

7 Data Science Projects on GitHub to Showcase your Skills!

Overview

Introduction

Table of contents

Top Data Science GitHub Projects

Machine Learning Projects

pyforest – Importing all Python Data Science Libraries in One Line of Code

HungaBunga – A Different Way of Building Machine Learning Models using sklearn

Deep Learning Projects

Behavior Suite for Reinforcement Learning (bsuite) by DeepMind

DistilBERT – A Lighter and Cheaper Version of Google’s BERT

ShuffleNet Series – An Extremely Efficient Convolutional Neural Network for Mobile Devices

RAdam – Improving the Variance of Learning Rates

Programming Projects

ggtext – Improved Text Rendering for ggplot2

End Notes

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang