6 Open Source Data Science Projects to Make you Industry Ready!
- The ideal time to work on your data science portfolio with these open source projects
- From datasets on COVID-19 to a collection of AutoML libraries by Google Brain, there’s a lot of data science projects to learn from
We are living in the midst of an unprecedented lockdown as governments around the world scramble to get a grip on the prevalent situation. But it’s not all doom and gloom – especially if you’re looking to upskill your data science portfolio and emerge with a solid and industry-relevant resume after the crisis abates!
This is an opportunity to really dig in and work on data science projects. A lot of folks suddenly have time on their hands which they did not see coming. Why not utilize that and work on grooming yourself for your dream data science role?
And there is no shortage of open source data science projects and ideas in the community. From computer vision and Natural Language Processing (NLP) projects to Python and data engineering ideas, there is a project out there for everyone. The only question is – where should you start?
And that’s the question I have tried to answer in this open source data science project series. This is the 27th edition of the series and I feel this has never been more relevant than it is today. So strap in, get your coding environment ready, and start working on your data science skills!
You can check out the entire archive of open source data science projects here. And if you’re a beginner in the world of machine learning, Analytics Vidhya has launched an awesome program to get you started. The Machine Learning Starter Program will teach you the basics of machine learning in a hands-on practical manner (and you get 14 days FREE access!).
6 Open-Source Data Science Projects to Enhance your Skills
Where else could we possibly begin? The coronavirus is dominating the world and no matter which site I turn to, COVID-19 is writ large in the headlines.
Thankfully, a lot of research labs and organizations globally have been collecting data around this and have open-sourced it for us. So why not use our data science knowledge and skills to work on a social welfare problem?
The GitHub repository I’ve linked here includes time series data tracking the number of people affected by the coronavirus globally, including:
- confirmed cases of the coronavirus
- the number of people who have died due to the coronavirus, and
- the number of people who have recovered from the deadly infection
The authors of this project update the dataset daily ina. CSV format so you can download it and start analyzing today!
You can also check out this GitHub repository containing datasets for the coronavirus cases exclusively in the United States (broken down by state and county).
Here are a few resources to help you understand how time series forecasting works:
- 7 Methods to Perform Time Series Forecasting
- A Gentle Introduction to Handling a Non-Stationary Time Series in Python
- Time Series Forecasting using Python (FREE Course)
- All tutorials on time series
The Natural Language Processing (NLP) field has come leaps and bounds in the last 3 years. Starting from the Transformer architecture in 2017, we have seen a slew of breakthroughs and ground-breaking NLP libraries since then, including Google’s BERT, OpenAI’s GPT-2, among others.
This GitHub repository is a collection of key NLP papers summarized for a broader set of data science professionals. Here is a key list of topics covered in this repository:
- Dialogue and Interactive Systems
- Ethics and NLP
- Text Generation
- Information Extraction
- Information Retrieval and Text Mining
- Interpretability and Analysis of Models for NLP
- Language Grounding to Vision, Robotics and Beyond
- Language Modeling
- Machine Learning for NLP
- Machine Translation
- Multi-Task Learning
- NLP Applications
- Question Answering
- Resources and Evaluation
- Sentiment Analysis, Stylistic Analysis, and Argument Mining
- Speech and Multimodality
- Text Summarization
- Syntax: Tagging, Chunking, and Parsing
There are plenty more NLP topics inside. This is as good a project as any to pass the time during the lockdown! Pick an NLP paper and start parsing through it. That is a LOT of knowledge available under one umbrella.
If you’re new to NLP, I suggest going through the below tutorials and resources:
- All NLP Tutorials on Analytics Vidhya
- Introduction to Natural Language Processing (NLP) (Free Course!)
Automated Machine Learning, or AutoML, caters to automating certain tasks of the typical machine learning pipeline. What started off as a side project a few years ago to save time is now a full-blown area of research. There are tons of AutoML tools in the market that can automate the entire ML pipeline for organizations.
AutoML is especially gaining traction for businesses that don’t have a dedicated data science team or can’t afford to hire one from scratch. Almost every tech giant has an AutoML solution in the market, from Google’s Cloud AutoML to Baidu’s EZDL.
This data science project by the Google Brain team contains a list of AutoML related models and libraries. The GitHub repository has amassed over 1,600 stars since it was open-sourced 6 days ago. Amazing!
Here are a few key articles and tutorials around AutoML you should check out:
- A Hands-On Guide to Automated Feature Engineering using Featuretools in Python
- 19 Data Science and Machine Learning Tools for People who Don’t Know Programming
Here’s another awesome open source project by the Google Research team. This pertains to the Natural Language Processing (NLP) domain and the Transformer architecture I mentioned earlier.
Here’s how the Google Research team defines ELECTRA:
“ELECTRA is a new method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish “real” input tokens vs “fake” input tokens generated by another neural network.”
What impressed me about ELECTRA is the accuracy we can achieve even on a single GPU. ELECTRA goes to a different level entirely on large scale datasets and achieves state-of-the-art performance on the SQuAD 2.0 benchmark.
You can read about ELECTRA in-depth in Google’s research paper. The team has released three pretrained models for now:
You need to have the below requirements installed on your machine before you begin:
- Python 3
- TensorFlow 1.15
- scikit-learn and SciPy
You can go through the below tutorials to understand what pretrained models and transfer learning are:
- 6 Pretrained Models to Master Text Classification
- 8 Pretrained Models to Learn Natural Language Processing (NLP)
- Transfer Learning and the Art of using Pretrained Models
GANs, or Generative Adversarial Networks, took the data science world by storm when Ian Goodfellow introduced them in 2014. These GANs have since morphed into useful (and often entertaining) applications, such as generating art and creating movies.
But a significant issue with training a GAN model is the sheer computational power required. This is where GAN Compression comes in.
GAN Compression is “a general-purpose method for compressing conditional GANs”. It reduces the computation of popular GAN-based models, such as pix2pix, CycleGAN, etc. Just check out this awesome example:
You can learn more about GANs, how they work, and their real-world applications here:
I’m thrilled to bring out another state-of-the-art GAN architecture in this article. StyleGAN was a hit in the computer vision community and StyleGAN2 takes things towards an even more realistic level.
“StyleGAN2 is a state-of-the-art network in generating realistic images. Besides, it was explicitly trained to have disentangled directions in latent space, which allows efficient image manipulation by varying latent factors.”
That is the power of StyleGAN2. Slightly freakish but incredibly powerful. You can read about StyleGAN2 in the official research paper here.
This is the ideal time to pick up a data science project and start working on it. We don’t know when this crisis will end but we can utilize this time to invest in our learning and our future.
Which project are you planning to start next? Are there other open source data science projects you want to share with the community? Let me know in the comments section below and I’ll do my best to get the word out!