- Open source data science projects add a lot of value to your resume and help you stand out in an interview
- Here are 7 such open source data science projects you should work on this month
I’m going to give you a tip I wish someone had given me when I started my data science career. When I was navigating the obstacle-filled journey through the backwaters of data science, I had quite a struggle before I landed my first role. I had all the qualifications (or so I thought) but something seemed to be off.
That gap between what I brought to the table and what the interviewer expected was data science project experience.
Data science projects add a lot of value to your resume, especially if you’re a beginner. Most newcomers will have certifications but adding open source data science projects will give you a significant advantage over the competition. And trust me, there are an astonishing number of open source data science projects for you.
Here, I’ve put together a list of the top open-source data science projects that were created or released in June. This is part of my monthly project series where I bring out the best data science projects open-sourced on GitHub.
If you want to check out the previous projects, I’ve put them together in the form of a free course. They’re structured by the domain (computer vision projects, NLP projects, etc.) so you can focus on the project you want. And if you’re new to GitHub, make sure you’re enrolled in this free introduction to Git and GitHub course.
Open Source Data Science Projects to Enhance your Resume
I have divided the projects into three categories based on their domain:
- Machine Learning
- Computer Vision
- Other open-source data science projects, including an awesome dataset
Let’s look at each category individually.
Open Source Machine Learning Projects
This is where you’ll get the lay of the machine learning land. We’ll cover three useful open source projects here related to machine learning. You can pick a project based on your interests or try all of them. I have tried to keep them as diverse as possible so you’ll see a project on machine learning papers and another of building machine learning pipelines.
If you’re looking for guidance or are new to this field, I’ll direct you to a few helpful learning resources:
Reading machine learning research papers is quite a daunting prospect for most professionals, let alone beginners. Data scientists and machine learning researchers tend to write extremely technical papers that even experts have a hard time decoding. This is actually one of the biggest pain points in our field.
So any effort to break down the complexity is always welcome. This helpful project is a collection of data science and machine learning papers “with illustrations, annotations, and brief explanations of technical keywords, terms, and previous studies which makes it easier to read the paper and to get the main idea”.
This project was open sourced on GitHub just last week so it’s being updated regularly. Right now we can see a few papers there already so you can go through them to get an idea of how the annotations have been done. I especially love the YOLOv1 annotation:
Pretty cool! Go ahead and explore this plus the other papers. There’s a lot to learn!
This is quite an interesting project for anyone who has a bit of data science knowledge.
NeoML is a comprehensive machine learning framework that enables us to build, train, and deploy machine learning models. In short, we can build an end-to-end machine learning pipeline without the hassle of spending big money on out-of-the-box solutions.
Data scientists and data engineers can use it for computer vision and Natural Language Processing (NLP) tasks, such as image preprocessing, classification, document layout analysis, OCR, and data extraction from structured and unstructured documents.
Here are the key feature of NeoML I’ve taken from their GitHub repository:
- Neural networks with support for over 100 layer types
- Traditional machine learning: 20+ algorithms (classification, regression, clustering, etc.)
- CPU and GPU support, fast inference
- ONNX support
- Languages: C++, Java, Objective-C
- Cross-platform: the same code can be run on Windows, Linux, macOS, iOS, and Android
Here’s a beginner-friendly article on how to build machine learning pipelines:
Here’s another project that any data scientist would love, especially if you’re inclined towards research. We often struggle to go from a test environment to a full-scale deployment – it’s not an easy step to take (we really should appreciate the role data engineers play).
Google, of course, has a potential solution for us in the form of Caliban. This is a tool that will help you launch and track your numerical experiments in an isolated, reproducible computing environment. Caliban was developed by machine learning researchers and engineers over at Google.
As they put it, Caliban “makes it easy to go from a simple prototype running on a workstation to thousands of experimental jobs running on Cloud”. Here are the key highlights you should be aware of:
- Develop your experimental code locally and test it inside an isolated (Docker) environment
- Easily sweep over experimental parameters
- Submit your experiments as Cloud jobs, where they will run in the same isolated environment
- Control and keep track of jobs
Open Source Computer Vision Projects
I’m amazed by the progress we are seeing in computer vision (no pun intended!). It seems every month when I sit down to write this article, I come across more and more groundbreaking frameworks and new approaches that enhance the state-of-the-art in this field.
Organizations are scouring the globe for computer vision talent right now so it’s a great time to work on these projects and get into the field. If you haven’t yet started reading about computer vision, here are a few helpful resources:
What if I gave you a target image and asked you to write a computer vision program that created the image from scratch? Yes, that’s the power of computer vision!
This really cool open source project enables us to imitate a drawing process when we’re provided with a target image. Here’s a small demo of what the process looks like:
I can’t wait to get my hands on this and start drawing up all sorts of stuff. You’ll need the below Python libraries to run this:
- OpenCV 3.4.1
- NumPy 1.16.2
- matplotlib 3.0.3
The developer has also given us an example so you can execute that and watch the magic of computer vision unfold. I’d also suggest going through the below OpenCV articles if you haven’t worked with it before:
This open source project caters to slightly more advanced data scientists. To understand what this project is about, we need to grasp the concept of single-image super-resolution. In simple terms, the aim here is to construct a high-resolution image from a corresponding low-resolution input.
Sounds like a classic computer vision project!
PULSE is a novel solution to this problem statement. Short for Photo Upsampling via Latent Space Exploration, PULSE generates high-resolution and ultra-realistic images at incredibly high resolutions. And this is accomplished in an entirely self-supervised fashion and is not confined to a specific degradation operator used during training.
Here’s an example of how PULSE works:
I’d encourage you to first read the research paper before looking at the code. This will give you a better idea of how PULSE works underneath so you can tackle the code with much more clarity.
Other Open Source Data Science Projects
Here are a couple of open-source data science projects that didn’t quite fit the above two categories. These are actually two contrasting projects – one caters to beginners in data science while the other deals with the world of reinforcement learning.
Pick whichever one works best for you and start exploring it.
I’m sure most of you have worked with the Iris dataset. In fact, it might even have been the very first dataset you used to understand the concept of classification in machine learning. I love how simple the dataset is to understand and explore.
But working with the same dataset can become a bit dour, especially when you’re learning the ins and outs of machine learning.
This is where the PalmerPenguins dataset comes in. Open sourced last month, this dataset positions itself as an alternative to Iris and aims to provide a great dataset for data exploration & visualization, especially for beginners. Here’s a taste of the visualizations you can come up with:
The link I’ve mentioned above contains examples of how to start exploring this data. They’ve even provided details about the different variables but wouldn’t you want to explore that yourself? 🙂
You can get PalmerPenguins on your machine using the below code:
# install.packages("remotes") remotes::install_github("allisonhorst/palmerpenguins")
I also recommend checking out the below popular articles on data exploration and visualization:
- 6 Essential Data Visualization Python Libraries – Matplotlib, Seaborn, Bokeh, Altair, Plotly, GGplot
- 10 matplotlib Tricks to Master Data Visualization in Python
- Analytics Vidhya’s Entire Collection of Data Visualization Tutorials
Ah, here’s an open source project for all you reinforcement learning folks. SlimeVolleyGym is a simple gym environment for testing single and multi-agent reinforcement learning algorithms. This has been created and open-sourced by hardmaru, a legend in the machine learning space.
The game is very simple: the agent’s goal is to get the ball to land on the ground of its opponent’s side, causing its opponent to lose a life. Each agent starts off with five lives. The episode ends when either agent loses all five lives, or after 3000 timesteps have passed. An agent receives a reward of +1 when its opponent loses or -1 when it loses a life.
You can install slimevolleygym directly from pip:
pip install slimevolleygym
Here are a couple of excellent tutorials by our resident reinforcement learning expert Ankit Choudhary:
- Introduction to Monte Carlo Learning using the OpenAI Gym Toolkit
Phew – that’s a lot of projects. My aim, as always, was to keep the projects as diverse as possible so you can pick the ones that fit into your data science journey. If you’re a beginner, I would suggest starting with the PalmerPenguins dataset as most folks aren’t even aware of it right now. A great chance to get a head start.
I would love to hear your thoughts on which open source project you found the most useful. Or let me know if you want me to feature any other data science projects here or in next month’s edition.You can also read this article on our Mobile APP