5 Open Source Machine Learning Projects to Challenge your Inner Data Scientist

datascience22 14 Jun, 2020

6 min read

Overview

Start 2020 on the right note with these 5 challenging open-source machine learning projects
These machine learning projects cover a diverse range of domains, including Python programming and NLP

Introduction

More people than ever before are looking for a way to transition into data science. Whether you’re a fresh college graduate, a relatively new entrant in the industry, a mid-level professional, or someone who’s just curious about machine learning – everyone wants a piece of the data science pie.

And if you’re from India, you would surely have read about the Government’s investment in the data field (in the 2020 Union Budget). This is a great time to invest in your career!

And one of the best ways to get your data science career off the ground is to invest in yourself. Here’s a simple path to do that:

Find an open-source machine learning project that you are passionate about
Understand the current benchmark solution for that project
If it exists, learn from it. If it doesn’t, carve out a solution using your existing machine learning skillset

I’ve picked out 5 open-source machine learning projects (created in January 2020) to acquaint you with the latest state-of-the-art frameworks and libraries. As always, I tried to diversify the list as much as possible. You’ll see a bit of everything sprinkled in, from Natural Language Processing (NLP) to Python programming ideas.

Head over here if you’re interested in checking out the previous projects we’ve showcased in this monthly series. This is the 3rd year of this series – thanks to our community for the overwhelming response!

Without further ado, here are the 5 open-source machine learning projects

Reformer – The Efficient Transformer in PyTorch

The Transformer architecture changed the Natural Language Processing (NLP) landscape. It has spawned a plethora of NLP frameworks, such as BERT, XLNet, GPT-2, among others.

But there’s an issue I’m sure most of you will relate to – these Transformer-powered models are LARGE. They achieve state-of-the-art results but they’re way too expensive and beyond the scope of most folks who want to learn and implement them.

This is where the Reformer model comes in. Reformer performs as well as these Transformer models, but it does so while using far less resources and money.

This GitHub repository I’ve linked above contains the PyTorch implementation of Reformer. The author of the project has provided a simple but effective example along with the entire code to help you build your own model.

I encourage you to read about the inner workings of Reformer in the official research paper here.

You can install Reformer on your machine using the below command:

pip install reformer_pytorch

The below articles are essential reading if you’re new to the Transformer architecture and the PyTorch framework:

PandaPy – Your New Favorite Python Library

I came across PandaPy last week and have already used it in my current project. It is a fascinating Python library with a lot of potential to become mainstream.

If you are working on a machine learning project with mixed data types (int, float, datetime, str, etc.), you should try out PandaPy instead of Pandas. It consumes roughly one-third less memory than Pandas for these data types!

“If you have smaller Pandas dataframes (<50K number of records) in a production environment, then it is worth considering PandaPy.”

Here are three key areas you’ll find interesting (I’ve taken these points verbatim from the PandaPy GitHub repository):

For simple calculations on a small dataset (i.e, plus, mult, log) PandaPy is 25x – 80x faster than Pandas
For table functions (i.e., group, pivot, drop, concat, fillna) on a small data set PandaPy is 5x – 100x times faster than Pandas
For most use cases with small data, PandaPy is faster than Dask, Modin Ray and Pandas

Install PandaPy using pip:

!pip3 install pandapy

If you still want to stick with Pandas, then check out the latest major release (v1.0.0) here.

Google Earth Engine – 300+ Jupyter Notebooks to Analyze Geospatial Data

What a brilliant GitHub repository! I’ve had a lot of aspiring data scientists reach out to me on LinkedIn asking about how to get started with geospatial analysis. It’s a very interesting field with petabytes of data available. We just need a structured approach to clean and analyze it.

This amazing repository is a collection of 300+ Jupyter notebooks that contain examples of using Google Earth Engine data.

Here’s a really cool GIF that demonstrates one of the visualizations you will generate using these notebooks:

These notebooks rely on three Python libraries to execute the code:

Earth Engine Python API
Folium
Geehydro

The GitHub repository contains plenty of examples with Python code to get you started. Dig in and have fun!

Here’s an excellent article to get started with Geospatial Data:

Geospatial Data and its Role in Data Science

AVA – Automated Visual Analytics

Here’s another quality data visualization idea for you. The thought of automating the data exploration step has been floated around for a while without any substantial frameworks. Until now

AVA, short for Automated Visual Analytics, is a framework by Alibaba that aims to make visual analytics AI-driven and automated.

Here’s a demo showing the power of AVA:

I highly recommend checking out the below resources to enhance and build your data visualization profile:

Fast Neptune – Speed up your Machine Learning Projects

Reproducibility is a crucial aspect of any machine learning project these days, whether that’s in research or the industry. We need to track every test we perform, every iteration, and every parameter of our machine learning model, along with the results.

The Fast Neptune library enables us to quickly record all the information we need to launch our machine learning experiments. In other words, Fast Neptune is your answer to the reproducibility question you might have asked while reading the above paragraph.

Here are the features Fast Neptune uses to help us run quick experiments (quoting from the above link):

Metadata about the machine where the code is run, including OS, and OS version
Requirements of the notebook where the experiments are run
Parameters used during the experience, which means the names of the values of the variables you want to track
Code you used during the run that you want to record

Pretty neat, right? Install Fast Neptune using just one line of code:

pip install fast-neptune

Couple of noteworthy frameworks to keep an eye on:

I wanted to highlight a couple of other major releases in January 2020 that you should be aware of:

Thinc: This is a lightweight deep learning library from the makers of spaCy. Thinc “offers an elegant, type-checked, functional-programming API for composing models, with support for layers defined in other frameworks such as PyTorch, TensorFlow or MXNet”
Google’s Incredible Human-Like Generative Chatbot: Google has created Meena, a 2.6 billion parameter end-to-end trained neural conversational model. Meena can conduct conversations that are more sensible and specific than existing state-of-the-art chatbots. Will they open-source the code? That remains to be seen but this is one to keep your eye on

End Notes

2020 is off to a fast start in the machine learning space. The state-of-the-art continues to evolve at a rapid pace and it can become overwhelming for newcomers to keep up.

That’s why I publish these monthly articles where I aim to bring out the most relevant and useful open-source machine learning projects for our community.

Is there any other machine learning project or framework you want to highlight? I would love to hear your thoughts and ideas in the comments section below. Let’s connect and brainstorm together.