6 Powerful Open Source Machine Learning GitHub Repositories for Data Scientists

Pranav Dar 27 Apr, 2020 • 5 min read

Overview

Check out the top 6 machine learning GitHub repositories created in June
There’s a heavy focus on NLP again, with XLNet outperforming Google’s BERT on several state-of-the-art benchmarks
All machine learning GitHub repositories are open source; download the code and start experimenting!

Introduction

Do you sometimes feel that machine learning is too broad and vast to keep up? I certainly feel that way. Just check out the list of major developments in Natural Language Processing (NLP) in the last year:

Google’s BERT
OpenAI’s GPT-2
Google’s Transformer-XL

It can become overwhelming as a data scientist to simply keep track of all that’s happening in machine learning. My aim of running this GitHub series since January 2018 has been to take that pain away for our community.

We trawl through every open source machine learning release each month and pick out the top developments we feel you should absolutely know. This is an ever-evolving field – and data scientists should always be on top of these breakthroughs. Otherwise, we risk being left behind.

This month’s machine learning GitHub collection is quite broad in its scope. I’ve covered one of the biggest NLP releases in recent times (XLNet), a unique approach to reinforcement learning by Google, understanding actions in videos, among other repositories.

Fun times ahead so let’s get rolling!

You can also go through the GitHub repositories and Reddit discussions we’ve covered so far this year:

Top Machine Learning GitHub Repositories

XLNet: The Next Big NLP Framework

Of course we are starting with NLP. It is the hottest field in machine learning right now. If you thought 2018 was a big year (and it was), 2019 has taken up the mantle now.

The latest state-of-the-art NLP framework is XLNet. It has taken the NLP (and machine learning) community by storm. XLNet uses Transformer-XL at its core. The developers have released a pretrained model as well to help you get started with XLNet.

XLNet has so far outperformed Google’s BERT on 20 NLP tasks and achieved state-of-the-art performance on 18 such tasks. Here are a few results on popular NLP benchmarks for reading comprehensions:

Model	RACE accuracy	SQuAD1.1 EM	SQuAD2.0 EM
BERT	72.0	84.1	78.98
XLNet	81.75	88.95	86.12

Want more? Here are the results for text classification:

Model	IMDB	Yelp-2	Yelp-5	DBpedia	Amazon-2	Amazon-5
BERT	4.51	1.89	29.32	0.64	2.63	34.17
XLNet	3.79	1.55	27.80	0.62	2.40	32.26

XLNet is, to put it mildly, very impressive. You can read the full research paper here.

Implementation of XLNet in PyTorch

Wait – were you wondering how you can implement XLNet on your machine? Look no further – this repository will get you started in no time.

If you’re well versed with NLP features this will be pretty simple to understand. But if you’re new to this field, take a few moments to go through the documentation I mentioned above and then try this out.

The developer(s) has also provided the entire code in Google Colab so you can leverage GPU power for free! This is a framework you DON’T want to miss out on.

Google Research Football – A Unique Reinforcement Learning Environment

I’m a huge football fan so the title of the repository instantly had my attention. Google Research and football – what in the world do these two have to do with each other?

Well, this “repository contains a reinforcement learning environment based on the open-source game Gameplay Football”. This environment was created exclusively for research purposes by the Google Research team. Here are a few scenarios produced within the environment:

Agents are trained to play football in an advanced, physics-based 3D simulator. I’ve seen a few RL environments in the last couple of years but this one takes the cake.

The research paper makes for interesting reading, especially if you’re a football or reinforcement learning enthusiast (or both!). Check it out here.

Implementation of the CRAFT Text Detector

This is a fascinating concept. CRAFT stands for Character Region Awareness for Text Detection. This should be on your to-read list if you’re interested in computer vision. Just check out this GIF:

Can you figure out how the algorithm is working? CRAFT detects the text area by exploring each character region present in the image. And the bounding box of the text? That is obtained by simply finding minimum bounding rectangles on a binary map.

You’ll grasp CRAFT in a jiffy if you’re familiar with the concept of object detection. This repository includes a pretrained model so you don’t have to code this algorithm from scratch!

You can find more details and an in-depth explanation of CRAFT in this paper.

MMAction – Open Source Toolbox for Action Understanding in Videos

Ever worked with video data before? It’s a really challenging but rewarding experience. Just imagine the sheer amount of things we can do and extract from a video.

How about understanding the action being performed in a particular video frame? That’s what the MMAction repository does. It is an “open source toolbox for action understanding based on PyTorch”. MMAction can perform the below tasks, as per the repository:

Action recognition from trimmed videos
Temporal action detection (also known as action localization) in untrimmed videos
Spatial-temporal action detection in untrimmed videos

MMAction’s developers have also provided tools to deal with different kinds of video datasets. The repository contains a healthy number of steps to at least get you up and running.

Here is the getting started guide for MMAction.

TRAINS – Auto-Magical Experiment Manager & Version Control for AI

One of the most crucial, and yet overlooked, aspects of a data scientist’s skillset – software engineering. It is an intrinsic part of the job. Knowing how to build models is great, but it’s equally important to understand the software side of your project.

If you’ve never heard of version control before, rectify that immediately. TRAINS “records and manages various deep learning research workloads and does so with practically zero integration costs”.

The best part about TRAINS (and there are many) is that it’s free and open source. You only need to write two lines of code to fully integrate TRAINS into your environment. It currently integrates with PyTorch, TensorFlow, and Keras and also supports Jupyter notebooks.

The developers have set up a demo server here. Go ahead and try out TRAINS using whatever code you want to test.

End Notes

My pick for this month is surely XLNet. It has opened up endless opportunities for NLP scientists. There’s only one caveat though – it requires strong computational power. Will Google Colab come to the rescue? Let me know if you’ve tried it out yet.

On a relevant note, NLP is THE field to get into right now. Developments are happening at breakneck speed and I can easily predict there’s a lot more coming this year. If you haven’t already, start delving into this as soon as you can.

Are there any other machine learning GitHub repositories I should include in this list? Which one did you like from this month’s collection? Let’s discuss in the comments section below.

Pranav Dar 27 Apr 2020

Senior Editor at Analytics Vidhya. Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

Data Science Deep Learning Github Intermediate Listicle

Frequently Asked Questions

Responses From Readers

Rohit 02 Jul, 2019

Surprisingly good read! Well thought out and presented.

1

Show 1 reply

Pranav Dar 02 Jul, 2019

Thanks, Rohit! Glad you liked the article.

Sophia 03 Jul, 2019

Very good article. Thank you

1

Show 1 reply

Pranav Dar 03 Jul, 2019

Glad you liked it, Sophia!

Richa 04 Jul, 2019

Very good article. Thanks for sharing it.

1

Show 1 reply

Pranav Dar 05 Jul, 2019

Glad you liked it, Richa!

kasper 04 Jul, 2019

Thank you ! Always good articles Are there any? regarding -realtime performance issues with OpenCv vs. Python, as a frontend or / do we still work with C++ Greetings from sweden Kasper