6 Open Source Data Science Projects to Make you Industry Ready!

Last Updated : 14 Apr, 2020

6 min read

Overview

The ideal time to work on your data science portfolio with these open source projects
From datasets on COVID-19 to a collection of AutoML libraries by Google Brain, there’s a lot of data science projects to learn from

Introduction

We are living in the midst of an unprecedented lockdown as governments around the world scramble to get a grip on the prevalent situation. But it’s not all doom and gloom – especially if you’re looking to upskill your data science portfolio and emerge with a solid and industry-relevant resume after the crisis abates!

This is an opportunity to really dig in and work on data science projects. A lot of folks suddenly have time on their hands which they did not see coming. Why not utilize that and work on grooming yourself for your dream data science role?

And there is no shortage of open source data science projects and ideas in the community. From computer vision and Natural Language Processing (NLP) projects to Python and data engineering ideas, there is a project out there for everyone. The only question is – where should you start?

And that’s the question I have tried to answer in this open source data science project series. This is the 27th edition of the series and I feel this has never been more relevant than it is today. So strap in, get your coding environment ready, and start working on your data science skills!

You can check out the entire archive of open source data science projects here. And if you’re a beginner in the world of machine learning, Analytics Vidhya has launched an awesome program to get you started. The Machine Learning Starter Program will teach you the basics of machine learning in a hands-on practical manner (and you get 14 days FREE access!).

6 Open-Source Data Science Projects to Enhance your Skills

Coronavirus Time Series Data

Where else could we possibly begin? The coronavirus is dominating the world and no matter which site I turn to, COVID-19 is writ large in the headlines.

Thankfully, a lot of research labs and organizations globally have been collecting data around this and have open-sourced it for us. So why not use our data science knowledge and skills to work on a social welfare problem?

The GitHub repository I’ve linked here includes time series data tracking the number of people affected by the coronavirus globally, including:

confirmed cases of the coronavirus
the number of people who have died due to the coronavirus, and
the number of people who have recovered from the deadly infection

The authors of this project update the dataset daily ina. CSV format so you can download it and start analyzing today!

You can also check out this GitHub repository containing datasets for the coronavirus cases exclusively in the United States (broken down by state and county).

Here are a few resources to help you understand how time series forecasting works:

NLP Paper Summaries

The Natural Language Processing (NLP) field has come leaps and bounds in the last 3 years. Starting from the Transformer architecture in 2017, we have seen a slew of breakthroughs and ground-breaking NLP libraries since then, including Google’s BERT, OpenAI’s GPT-2, among others.

This GitHub repository is a collection of key NLP papers summarized for a broader set of data science professionals. Here is a key list of topics covered in this repository:

Dialogue and Interactive Systems
Ethics and NLP
Text Generation
Information Extraction
Information Retrieval and Text Mining
Interpretability and Analysis of Models for NLP
Language Grounding to Vision, Robotics and Beyond
Language Modeling
Machine Learning for NLP
Machine Translation
Multi-Task Learning
NLP Applications
Question Answering
Resources and Evaluation
Semantics
Sentiment Analysis, Stylistic Analysis, and Argument Mining
Speech and Multimodality
Text Summarization
Syntax: Tagging, Chunking, and Parsing

There are plenty more NLP topics inside. This is as good a project as any to pass the time during the lockdown! Pick an NLP paper and start parsing through it. That is a LOT of knowledge available under one umbrella.

If you’re new to NLP, I suggest going through the below tutorials and resources:

Google Brain AutoML

Automated Machine Learning, or AutoML, caters to automating certain tasks of the typical machine learning pipeline. What started off as a side project a few years ago to save time is now a full-blown area of research. There are tons of AutoML tools in the market that can automate the entire ML pipeline for organizations.

AutoML is especially gaining traction for businesses that don’t have a dedicated data science team or can’t afford to hire one from scratch. Almost every tech giant has an AutoML solution in the market, from Google’s Cloud AutoML to Baidu’s EZDL.

This data science project by the Google Brain team contains a list of AutoML related models and libraries. The GitHub repository has amassed over 1,600 stars since it was open-sourced 6 days ago. Amazing!

Here are a few key articles and tutorials around AutoML you should check out:

Google’s ELECTRA

Here’s another awesome open source project by the Google Research team. This pertains to the Natural Language Processing (NLP) domain and the Transformer architecture I mentioned earlier.

Here’s how the Google Research team defines ELECTRA:

“ELECTRA is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish “real” input tokens vs “fake” input tokens generated by another neural network.”

What impressed me about ELECTRA is the accuracy we can achieve even on a single GPU. ELECTRA goes to a different level entirely on large scale datasets and achieves state-of-the-art performance on the SQuAD 2.0 benchmark.

You can read about ELECTRA in-depth in Google’s research paper. The team has released three pretrained models for now:

You need to have the below requirements installed on your machine before you begin:

Python 3
TensorFlow 1.15
NumPy
scikit-learn and SciPy

You can go through the below tutorials to understand what pretrained models and transfer learning are:

GAN Compression

GANs, or Generative Adversarial Networks, took the data science world by storm when Ian Goodfellow introduced them in 2014. These GANs have since morphed into useful (and often entertaining) applications, such as generating art and creating movies.

But a significant issue with training a GAN model is the sheer computational power required. This is where GAN Compression comes in.

GAN Compression is “a general-purpose method for compressing conditional GANs”. It reduces the computation of popular GAN-based models, such as pix2pix, CycleGAN, etc. Just check out this awesome example:

You can learn more about GANs, how they work, and their real-world applications here:

StyleGAN2 – A New State-of-the-Art GAN!

I’m thrilled to bring out another state-of-the-art GAN architecture in this article. StyleGAN was a hit in the computer vision community and StyleGAN2 takes things towards an even more realistic level.

“StyleGAN2 is a state-of-the-art network in generating realistic images. Besides, it was explicitly trained to have disentangled directions in latent space, which allows efficient image manipulation by varying latent factors.”

That is the power of StyleGAN2. Slightly freakish but incredibly powerful. You can read about StyleGAN2 in the official research paper here.

End Notes

This is the ideal time to pick up a data science project and start working on it. We don’t know when this crisis will end but we can utilize this time to invest in our learning and our future.

Which project are you planning to start next? Are there other open source data science projects you want to share with the community? Let me know in the comments section below and I’ll do my best to get the word out!

Beginner Github Interview Prep Listicle

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Introduction to Deep Learning

Feed Forward Networks

Feed Forward Networks

Gradient Descent

Loss Function

Activation Functions

Introduction to Neural networks

Forward and Backward Propagation

Optimizers

Learning Rate Schedulers

NN on Structured Data

Improving the Deep Learning Model

Deep Learning Model Optimization

Unsupervised Deep Learning

AutoDL

Model Deployment

Introduction to PyTorch

6 Open Source Data Science Projects to Make you Industry Ready!

Overview

Introduction

6 Open-Source Data Science Projects to Enhance your Skills

Coronavirus Time Series Data

NLP Paper Summaries

Google Brain AutoML

Google’s ELECTRA

GAN Compression

StyleGAN2 – A New State-of-the-Art GAN!

End Notes

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv