The 25 Best Data Science and Machine Learning GitHub Repositories from 2018
What’s the best platform for hosting your code, collaborating with team members, and also acts as an online resume to showcase your coding skills? Ask any data scientist, and they’ll point you towards GitHub. It has been a truly revolutionary platform in recent years and has changed the landscape of how we host and even do coding.
But that’s not all. It acts as a learning tool as well. How, you ask? I’ll give you a hint – open source!
The world’s leading tech companies open source their projects on GitHub by releasing the code behind their popular algorithms. 2018 saw a huge spike in such releases, with the likes of Google and Facebook leading the way. The best part about these releases is that the researchers behind the code also provide pretrained models so folks like you and I don’t have to waste time building difficult models from scratch.
Additionally, we regularly see the top trending repositories aimed towards coders and developers – this includes resources like cheatsheets, video links, e-books, research paper links, among other things. No matter which level you are at in your professional career (beginner, established or advanced), you will always find something new to learn on GitHub.
2018 was a transcendent one in a lot of data science sub-fields, as we will shortly see. Natural Language Processing (NLP) was easily the most talked about domain within the community with the likes of ULMFiT and BERT being open-sourced. In my quest to bring the best to our awesome community, I ran a monthly series throughout the year where I hand-picked the top 5 projects every data scientist should know about. You can check out the entire collection below:
There will be some overlap here with my article covering the biggest breakthroughs in AI and ML in 2018. Do check out that article as well – it is essentially a list of all the major developments I feel everyone in this field needs to know about. As a bonus, there are predictions from experts as well – not something you want to miss. 🙂
And now, get ready to explore new projects in your quest to attain data science stardom in 2019 and scroll down! Simply click on each project title to head over to the code repository on GitHub.
Topics we will cover in this article
- Tools and Frameworks
- Computer Vision
- Generative Adversarial Networks (GANs)
- Other Deep Learning Projects
- Natural Language Processing (NLP)
- Automated Machine Learning (AutoML)
- Reinforcement Learning
Tools and Frameworks
Let’s get the ball rolling with a look at the top projects in terms of tools, libraries and frameworks. Since we are speaking about a software repository platform, it feels right to open things with this section.
Technology is advancing rapidly and computational costs are lower than ever, so we’re being treated to one massive release after another. Can we call this the golden age of coding in machine learning? That is an open question, but one thing we can all agree on – it’s a great time to be a programmer in data science. In this section (and the article overall), I have tried to diversify the languages as much as possible, but Python inevitably ruled the roost.
How about all you .NET developers wanting to learn a bit of machine learning to complement your existing skills? Here’s the perfect repository to get that idea started! ML.NET, a Microsoft project, is an open-source machine learning framework that allows you design and develop models in .NET.
You can even integrate existing ML models into your application, all without requiring explicit knowledge of how ML models are developed. ML.NET is actually used in multiple Microsoft products, like Windows, Bing Search, MS Office, among others.
ML.NET runs on Windows, Linux and MacOS.
Machine learning in the browser! A fictional thought a few years back, a stunning reality now. A lot of us in this field are welded to our favorite IDEs, but TensorFlow.js has the potential to change your habits. It’s become a very popular release since it’s release earlier this year and continues to amaze with its flexibility.
As the repository states, there are primarily three major features of TensorFlow.js:
- Develop machine learning and deep learning models in your browser itself
- Run pre-existing TensorFlow models within the browser
- Retrain or fine-tune these pre-existing models as well
If you’re familiar with Keras, the high-level layers API will seem quite familiar. There are plenty of examples available on the GitHub repository, so check those out to quicken your learning curve.
What a year it has been for PyTorch. It has won the hearts and now projects of data scientists and ML researchers around the globe. It is easy to grasp, flexible, and is already being implemented across high profile researches (as you’ll see later in this article). The latest version (v1.0) already powers many Facebook products and services at scale, including performing 6 billion text translations a day. If you’ve been wondering when to start dabbling with PyTorch, the time is NOW.
If you’re new to this field, ensure you check out Faizan Shaikh’s guide to getting started with PyTorch.
While not strictly a tool or framework, this repository is a gold mine for all data scientists. Most of us struggle with reading through a paper and then implementing it (at least I do). There are a lot of moving parts that don’t seem to work on our machines.
And that’s where ‘Papers with Code’ comes in. As the name suggests, they have a code implementation of all the major papers that have been released in the last 6 years or so. It is a mind-blowing collection that you will find yourself fawning over. They have even added code from papers presented at NIPS (NeurIPS) 2018. Get yourself over there now!
Thanks to falling computational costs and a surge of breakthroughs from the top researchers (something tells me those two might be linked), deep learning is accessible to more people than ever before. And within deep learning, computer vision projects are ubiquitous – most of the repositories you’ll see in this section will cover one computer vision technique or another.
It is simply the hottest field in deep learning right now and will continue to be so for the foreseeable future. Whether it’s object detection or pose estimation, there’s a repository for seemingly all computer vision tasks. Never a better time to get acquainted with these developments – a lot of job openings might come your way soon.
Detectron made a HUGE splash when it was launched in early 2018. Developed by Facebook’s AI Research team (FAIR), it implements state-of-the-art object detection frameworks. It is (surprise, surprise) written in Python and has helped enable multiple projects, including DensePose (which we will talk about soon).
This repository contains the code and over 70 pretrained models. Too good an opportunity pass up, would’t you agree?
Object detection in images is awesome, but what about doing it in videos? And not just that, can we extend this concept and translate the style of one video to another? Yes, we can! It is a really cool concept and NVIDIA have been generous enough to release the PyTorch implementation for you to play around with.
The repository contains videos of how the technique looks, the full research paper, and of course the code. The Cityscapes dataset, available publicly post registration, is used in NVIDIA’s examples. One of my favorite projects from 2018.
Training a deep learning model in 18 minutes? While not having access to high-end computational resources? Believe me, it’s already been done. Fast.ai’s Jeremy Howard and his team of students built a model on the popular ImageNet dataset that even outperformed Google’s approach.
I encourage you to at least go through this project to get a sense of how these researchers structured their code. Not everyone has access to multiple GPUs (or even one) so this was quite a win for the minnows.
Another research paper collection repository! It’s always helpful to know how your subject of choice has evolved over a span of multiple years, and this one-stop shop will help you do just that for object detection. It’s a comprehensive collection of papers from 2014 till date, and even include code wherever possible.
The above image shows how object detection frameworks have evolved and transformed in the last five years. Quite fascinating, isn’t it? There’s even a 2019 entry included, so you have quite a lot of catching up to do.
Let’s turn our attention to the field of pose detection. I came across this concept this year itself and have been fascinated with it ever since. That above image captures the essence of this repository – dense human pose estimation in the wild.
The code to train and evaluate your own DensePose-RCNN model is included here. There are notebooks available as well to visualize the DensePose COCO dataset. Pretty good place to kick off your pose estimation learning.
The above image (taken from a video) really piqued my interest. I covered the release of the research paper back in August and have continued to be in awe of this technique. This technique enables us to transfer the motion between human objects in different videos. The video I mentioned is available within the repository – it will blow your mind!
This repository further contains the PyTorch implementation of this approach. The amount of intricate details this approach is capable of picking up and replicating is incredible.
I’m sure most of you must have come across a GAN application (even if you perhaps didn’t realize it at the time). GANs, or Generative Adversarial Networks, were introduced by Ian Goodfellow back in 2014 and have caught fire since. They specilize in performing creative tasks, especially artistic ones. Check out this amazing introductory guide by Faizan Shaikh to the world of GANs, along with an implementation in Python.
We saw a plethora of GAN based projects in 2018 and hence I wanted to create a separate section for this.
Let’s start off with one of my favorites. I want you to take a moment to just admire the above images. Can you tell which one was done by a human and which one by a machine? I certainly couldn’t. Here, the first frame is the input image (original) and the third frame has been generated by this technique.
Amazing, right? The algorithm adds an external object of your choosing to any image and manages to make it look like nothing touched it. Make sure you check out the code and try to implement it on a different set of images yourself. It’s really, really fun.
What if I gave you an image and asked you to extend the boundaries by imagining what it would look like when the entire scene was captured? You would understandably turn to some image editing software. But here’s the awesome news – you can achieve it in a few lines of code!
This project is a Keras implementation of Stanford’s Image Outpainting paper (incredibly cool and illustrated paper – this is how most research papers should be!). You can either build a model from scratch or use the one provided by this repository’s author. Deep learning wonders never cease to amaze.
If you haven’t got a handle on GANs yet, try out this project. Pioneered by researchers from MIT’s CSAIL division, it helped you visualize and understand GANs. You can explore what your GAN model has learned by inspecting and manipulating it’s neurons.
I would like to point you towards the official MIT project page, which has plenty of resources to get you familiar with the concept, including a video demo.
This algorithm enables you to change the facial expression of any person in an image. It’s as exciting as it is concerning. The images above inside the green border at the originals, the rest have been generated by GANimation.
The link contains a beginner’s guide, data preparation resources, prerequisites, and the Python code. As the author mentioned, do NOT use it for immoral purposes.
This project is quite similar to the Deep Painterly Harmonization one we saw earlier. But it deserved a mention given it came from NVIDIA themselves. As you can see in the image above, the FastPhotoStyle algorithm requires two inputs – a style photo and a content photo. The algorithm then works in one of two ways to generate the output – it either uses photorealistic image stylization code or uses semantic label maps.
Other Deep Learning Projects
The computer vision field has the potential to overshadow other work in deep learning but I wanted to highlight a few projects outside it.
Audio processing is another field where deep learning has started to make it’s mark. It’s not just limited to generating music, you can do tasks like audio classification, fingerprinting, segmentation, tagging, etc. There is a lot that’s still yet to be explored and who knows, perhaps you could use these projects to pioneer your way to the top.
Here are two intuitive articles to help you get acquainted with this line of work:
- Getting Started with Audio Data Analysis using Deep Learning (with case study)
- 10 Audio Processing Tasks to get you started with Deep Learning Applications (with Case Studies)
And here comes NVIDIA again. WaveGlow is a flow-based network capable of generating really high quality audio. It is essentially a single network for speech synthesis.
This repository includes a PyTorch implementation of WaveGlow along with a pre-trained model which you can download. The researchers have also listed down the steps you can follow if you want to train your own model from scratch.
Want to discover your own planet? That might perhaps be overstating things a bit, but this AstroNet repository will definitely get you close. The Google Brain team discovered two new planets in December 2017 by applying AstroNet. It’s a deep neural network meant for working with astronomical data. It goes to show the far-ranging applications of machine learning and was a truly monumental development.
And now the team behind the technology has open sourced the entire code (hint: the model is based on CNNs!) that powers AstroNet.
Who doesn’t love visualizations? But it can get a tad bit intimidating to imagine how a deep learning model works – there are too many moving parts involved. But VisualDL does a great job mitigating those challenges by designing specific deep learning jobs.
VisualDL currently supports the below components for visualizing jobs (you can see examples of each in the repository):
- high dimensional
Natural Language Processing (NLP)
Surprised to see NLP so down in this list? That’s primarily because I covered almost all the major open source releases in this article. I highly recommend checking out that list to stay on top of your NLP game. The frameworks I have mentioned here include ULMFiT, Google’s BERT, ELMo, and Facebook’s PyText. I will briefly mention BERT and a couple of other respositories here as I found them very helpful.
I couldn’t possibly let this section pass by without mentioning BERT. Google AI’s release has smashed records on it’s way to winning the hearts of NLP enthusiasts and experts alike. Following ULMFiT and ELMo, BERT really blew away the competition with it’s performance. It obtained state-of-the-art results on 11 NLP tasks.
Apart from the official Google repository I have linked to above, a PyTorch implementation of BERT is worth checking out. Whether it marks a new era of not in NLP we will soon find out.
It often helps to know how well your model is performing against a certain benchmark. For NLP, and specifically deep text matching models, I have found the MatchZoo toolkit quite reliable. Potential tasks related to MatchZoo include:
- Question Answer
- Textual Entailment
- Information Retrieval
- Paraphrase Identification
MatchZoo 2.0 is currently under development so expect to see a lot more being added to this already useful toolkit.
This repository was created by none other than Sebastian Ruder. The aim of this project is to track the latest progress in NLP. This includes both datasets and state-of-the-art models.
Any NLP technique you’ve ever wanted to know more about – there’s a good chance it’ll already be present here. The repository covers both traditional and core NLP tasks such as reading comprehension and parts-of-speech tagging. It’s mandatory to star/bookmark this repository if you’re even vaguely interested in this field.
Automated Machine Learning (AutoML)
What an year for AutoML. With industries look to integrate machine learning into their core mission, the need to data science specialists continues to grow. There is currently a massive gap between the demand and the supply. This gap could potentially be filled by AutoML tools.
These tools are designed for those people who do not have data science expertise. While there are certainly some incredible tools out there, most of them are priced significantly higher than most individuals can afford. So our amazing open source community came to the rescue in 2018, with two high profile releases.
This made quite a splash upon it’s release a few months ago. And why wouldn’t it? Deep learning has been long considered a very specialist field, so a library that can automate most tasks came as a welcome sign. Quoting from their official site, “The ultimate goal of AutoML is to provide easily accessible deep learning tools to domain experts with limited data science or machine learning background”.
You can install this library from pip:
pip install autokeras
The repository contains a simple example to give you a sense of how the whole thing works. You’re welcome, deep learning enthusiasts. 🙂
AdaNet is a framework for automatically learning high-quality models without requiring programming expertise. Since it’s a Google invention, the framework is based on TensorFlow. You can build ensemble models using AdaNet, and even extend it’s use to training a neural network.
The GitHub page contains the code, an example, the API documentation, and other things to get your hands dirty. Trust me, AutoML is the next big thing in our field.
Since I already covered a few reinforcement learning releases in my 2018 overview article, I will keep this section fairly brief. My hope in including a RL section where I can is to foster a discussion among our community and to hopefully accelerate research in this field.
First, make sure you check out OpenAI’s Spinning Up repository, an exhaustive educational resource for beginners. Then head over to Google’s Dopamine page. It is a research framework for accelerating research in this still nascent field. Now let’s look at a couple of other resources as well.
If you follow a few researchers on social media, you must have come across the above images in video form. A stick human running across a terrain, or trying to stand up, or some such sort. That, dear reader, is reinforcement learning in action.
Here is a signature example of it – a framework to train a simulated humanoid to imitate multiple motion skills. You can get the code, examples, and a step-by-step run-through on the above link.
This repository is a collection of reinforcement learning algorithms from Richard Sutton and Andrew Barto’s book and other research papers. These algorithms are presented in the form of Python notebooks.
As the author of this repo mentioned, you will only truly learn if you implement the learning as you go along. It’s a complex topic, and giving up or reading the resources like a storybook will lead you nowhere.
And that bring us to the end of our journey for 2018. What a year! It was a joyful ride putting this article together and I learned a lot of new stuff along the way.
I would love to hear your feedback on this article. Which repository have you used? Which one did you find the most useful? And which one(s) did I miss out on? Use the comments section below and let me know.