7 Open Source Data Science Projects you Should Add to your Resume

Pranav Dar Last Updated : 02 Jul, 2020

8 min read

Overview

Open source data science projects add a lot of value to your resume and help you stand out in an interview
Here are 7 such open source data science projects you should work on this month

Introduction

I’m going to give you a tip I wish someone had given me when I started my data science career. When I was navigating the obstacle-filled journey through the backwaters of data science, I had quite a struggle before I landed my first role. I had all the qualifications (or so I thought) but something seemed to be off.

That gap between what I brought to the table and what the interviewer expected was data science project experience.

Data science projects add a lot of value to your resume, especially if you’re a beginner. Most newcomers will have certifications but adding open source data science projects will give you a significant advantage over the competition. And trust me, there are an astonishing number of open source data science projects for you.

Here, I’ve put together a list of the top open-source data science projects that were created or released in June. This is part of my monthly project series where I bring out the best data science projects open-sourced on GitHub.

If you want to check out the previous projects, I’ve put them together in the form of a free course. They’re structured by the domain (computer vision projects, NLP projects, etc.) so you can focus on the project you want. And if you’re new to GitHub, make sure you’re enrolled in this free introduction to Git and GitHub course.

Open Source Data Science Projects to Enhance your Resume

I have divided the projects into three categories based on their domain:

Machine Learning
Computer Vision
Other open-source data science projects, including an awesome dataset

Let’s look at each category individually.

Open Source Machine Learning Projects

This is where you’ll get the lay of the machine learning land. We’ll cover three useful open source projects here related to machine learning. You can pick a project based on your interests or try all of them. I have tried to keep them as diverse as possible so you’ll see a project on machine learning papers and another of building machine learning pipelines.

If you’re looking for guidance or are new to this field, I’ll direct you to a few helpful learning resources:

Machine Learning Papers with Illustrations and Annotations

Reading machine learning research papers is quite a daunting prospect for most professionals, let alone beginners. Data scientists and machine learning researchers tend to write extremely technical papers that even experts have a hard time decoding. This is actually one of the biggest pain points in our field.

So any effort to break down the complexity is always welcome. This helpful project is a collection of data science and machine learning papers “with illustrations, annotations, and brief explanations of technical keywords, terms, and previous studies which makes it easier to read the paper and to get the main idea”.

This project was open sourced on GitHub just last week so it’s being updated regularly. Right now we can see a few papers there already so you can go through them to get an idea of how the annotations have been done. I especially love the YOLOv1 annotation:

Pretty cool! Go ahead and explore this plus the other papers. There’s a lot to learn!

NeoML – A Machine Learning Framework

This is quite an interesting project for anyone who has a bit of data science knowledge.

NeoML is a comprehensive machine learning framework that enables us to build, train, and deploy machine learning models. In short, we can build an end-to-end machine learning pipeline without the hassle of spending big money on out-of-the-box solutions.

Data scientists and data engineers can use it for computer vision and Natural Language Processing (NLP) tasks, such as image preprocessing, classification, document layout analysis, OCR, and data extraction from structured and unstructured documents.

Here are the key feature of NeoML I’ve taken from their GitHub repository:

Neural networks with support for over 100 layer types
Traditional machine learning: 20+ algorithms (classification, regression, clustering, etc.)
CPU and GPU support, fast inference
ONNX support
Languages: C++, Java, Objective-C
Cross-platform: the same code can be run on Windows, Linux, macOS, iOS, and Android

Here’s a beginner-friendly article on how to build machine learning pipelines:

Build your First Machine Learning Pipeline using scikit-learn

Google’s Caliban for Machine Learning

Here’s another project that any data scientist would love, especially if you’re inclined towards research. We often struggle to go from a test environment to a full-scale deployment – it’s not an easy step to take (we really should appreciate the role data engineers play).

Google, of course, has a potential solution for us in the form of Caliban. This is a tool that will help you launch and track your numerical experiments in an isolated, reproducible computing environment. Caliban was developed by machine learning researchers and engineers over at Google.

As they put it, Caliban “makes it easy to go from a simple prototype running on a workstation to thousands of experimental jobs running on Cloud”. Here are the key highlights you should be aware of:

Develop your experimental code locally and test it inside an isolated (Docker) environment
Easily sweep over experimental parameters
Submit your experiments as Cloud jobs, where they will run in the same isolated environment
Control and keep track of jobs

Open Source Computer Vision Projects

I’m amazed by the progress we are seeing in computer vision (no pun intended!). It seems every month when I sit down to write this article, I come across more and more groundbreaking frameworks and new approaches that enhance the state-of-the-art in this field.

Organizations are scouring the globe for computer vision talent right now so it’s a great time to work on these projects and get into the field. If you haven’t yet started reading about computer vision, here are a few helpful resources:

Genetic Drawing

What if I gave you a target image and asked you to write a computer vision program that created the image from scratch? Yes, that’s the power of computer vision!

This really cool open source project enables us to imitate a drawing process when we’re provided with a target image. Here’s a small demo of what the process looks like:

I can’t wait to get my hands on this and start drawing up all sorts of stuff. You’ll need the below Python libraries to run this:

OpenCV 3.4.1
NumPy 1.16.2
matplotlib 3.0.3

The developer has also given us an example so you can execute that and watch the magic of computer vision unfold. I’d also suggest going through the below OpenCV articles if you haven’t worked with it before:

16 OpenCV Functions to Start your Computer Vision journey (with Python code)

PULSE – Face Depixelizer

This open source project caters to slightly more advanced data scientists. To understand what this project is about, we need to grasp the concept of single-image super-resolution. In simple terms, the aim here is to construct a high-resolution image from a corresponding low-resolution input.

Sounds like a classic computer vision project!

PULSE is a novel solution to this problem statement. Short for Photo Upsampling via Latent Space Exploration, PULSE generates high-resolution and ultra-realistic images at incredibly high resolutions. And this is accomplished in an entirely self-supervised fashion and is not confined to a specific degradation operator used during training.

Here’s an example of how PULSE works:

I’d encourage you to first read the research paper before looking at the code. This will give you a better idea of how PULSE works underneath so you can tackle the code with much more clarity.

Other Open Source Data Science Projects

Here are a couple of open-source data science projects that didn’t quite fit the above two categories. These are actually two contrasting projects – one caters to beginners in data science while the other deals with the world of reinforcement learning.

Pick whichever one works best for you and start exploring it.

PalmerPenguins – An Awesome Dataset for Exploration and Visualization

I’m sure most of you have worked with the Iris dataset. In fact, it might even have been the very first dataset you used to understand the concept of classification in machine learning. I love how simple the dataset is to understand and explore.

But working with the same dataset can become a bit dour, especially when you’re learning the ins and outs of machine learning.

This is where the PalmerPenguins dataset comes in. Open sourced last month, this dataset positions itself as an alternative to Iris and aims to provide a great dataset for data exploration & visualization, especially for beginners. Here’s a taste of the visualizations you can come up with:

The link I’ve mentioned above contains examples of how to start exploring this data. They’ve even provided details about the different variables but wouldn’t you want to explore that yourself? 🙂

You can get PalmerPenguins on your machine using the below code:

# install.packages("remotes")
remotes::install_github("allisonhorst/palmerpenguins")

I also recommend checking out the below popular articles on data exploration and visualization:

Slime Volleyball Gym Environment

Ah, here’s an open source project for all you reinforcement learning folks. SlimeVolleyGym is a simple gym environment for testing single and multi-agent reinforcement learning algorithms. This has been created and open-sourced by hardmaru, a legend in the machine learning space.

Here’s how the game works according to him (he created the game himself in JavaScript):

The game is very simple: the agent’s goal is to get the ball to land on the ground of its opponent’s side, causing its opponent to lose a life. Each agent starts off with five lives. The episode ends when either agent loses all five lives, or after 3000 timesteps have passed. An agent receives a reward of +1 when its opponent loses or -1 when it loses a life.

You can install slimevolleygym directly from pip:

pip install slimevolleygym

Here are a couple of excellent tutorials by our resident reinforcement learning expert Ankit Choudhary:

End Notes

Phew – that’s a lot of projects. My aim, as always, was to keep the projects as diverse as possible so you can pick the ones that fit into your data science journey. If you’re a beginner, I would suggest starting with the PalmerPenguins dataset as most folks aren’t even aware of it right now. A great chance to get a head start.

I would love to hear your thoughts on which open source project you found the most useful. Or let me know if you want me to feature any other data science projects here or in next month’s edition.

Pranav Dar

Senior Editor at Analytics Vidhya.Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Rahul

I would like to enroll in your "Top Data Science Projects for Analysts and Data Scientists" course but I did not find any option to enroll.

Show 1 reply

Hi Rahul - This was a temporary error - can you please check again and enroll?

Alisher Abdulkhaev

Thank you Pranav for featuring us in your post :) Looking forward for the contributions!

Dan Elbert

Hi Pranav Very interesting article. How would a beginner use these projects to showcase his skills and his own work? Are those projects that accept contributions? Dan

Reading list

7 Open Source Data Science Projects you Should Add to your Resume

Overview

Introduction

Open Source Data Science Projects to Enhance your Resume

Open Source Machine Learning Projects

Machine Learning Papers with Illustrations and Annotations

NeoML – A Machine Learning Framework

Google’s Caliban for Machine Learning

Open Source Computer Vision Projects

Genetic Drawing

PULSE – Face Depixelizer

Other Open Source Data Science Projects

PalmerPenguins – An Awesome Dataset for Exploration and Visualization

Slime Volleyball Gym Environment

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Introduction to Deep Learning

Feed Forward Networks

Gradient Descent

Loss Function

Activation Functions

Introduction to Neural networks

Forward and Backward Propagation

Optimizers

Learning Rate Schedulers

NN on Structured Data

Improving the Deep Learning Model

Deep Learning Model Optimization

Unsupervised Deep Learning

AutoDL

Model Deployment

Introduction to PyTorch

7 Open Source Data Science Projects you Should Add to your Resume

Overview

Introduction

Open Source Data Science Projects to Enhance your Resume

Open Source Machine Learning Projects

Machine Learning Papers with Illustrations and Annotations

NeoML – A Machine Learning Framework

Google’s Caliban for Machine Learning

Open Source Computer Vision Projects

Genetic Drawing

PULSE – Face Depixelizer

Other Open Source Data Science Projects

PalmerPenguins – An Awesome Dataset for Exploration and Visualization

Slime Volleyball Gym Environment

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques