The Top GitHub Repositories & Reddit Threads Every Data Scientist should know (June 2018)
Half the year has flown by and that brings us to the June edition of our popular series – the top GitHub repositories and Reddit threads from last month. During the course of writing these articles, I have learned so much about machine learning from either open source codes or invaluable discussions among the top data science brains in the world.
What makes GitHub special is not just it’s code hosting and social collaboration features for data scientists. It has lowered the entry barrier into the open source world and has played a MASSIVE role in spreading knowledge and expanding the machine learning community.
We saw some amazing open source code being released in June. One of the most intriguing repositories was ‘NLP Progress’ – created with the aim of keeping everyone updated regarding the latest updates in this field. Facebook also released the code for it’s popular DensePose framework which could be a game changer in the pose estimation field.
When it comes to Reddit, it is so rich in knowledge and perspective from data scientists and ML experts from around the globe. In this article you’ll see discussions on reinforcement learning applications, machine learning setups, a wonderful computer vision example, and much more. I highly recommend participating in this discussions to enhance your skillset.
You can check out the top GitHub repositories and top Reddit discussions (from April onwards) for the last five months below:
Human pose estimation has garnered a lot of attention in the deep learning community this year. And Facebook took things to a new level when they open sourced the code to ‘DensePose’, their popular pose estimation framework. This technique identifies more than 5000 nodes in the human body (for context, other approaches operate with 10 or 20 joints). You can get an idea of this node mapping technique in the above image.
DensePose has been created in the Detectron framework and is powered by Caffe2. Apart from the code, this repository also contains notebooks to visualize the DensePose-COCO dataset. Read more details about this release here.
Natural Language Processing (NLP) is an often difficult field to get into, despite it’s attractions. There is tons of unstructured text lying around which you have to work with, and that is no easy task. This repository has been created especially to track the progress in the NLP field. It’s a very informative list of datasets and current state-of-the-art tasks, like dependency parsing, part-of-speech tagging, reading comprehension, etc.
Make sure you star this and follow the progress if you’re even vaguely interested in NLP. There is still a lot that can (and will) be added to this list, like information extraction, relation extraction, grammatical error correction, etc. If you have anything to contribute to this repository, the creator is open to ideas and suggestions so feel free to do that.
Getting your model into production is one of the biggest challenges data scientists face when they enter this field. Designing and building the model is what attracts most people to machine learning but if you can’t get that model into production, it essentially becomes just a piece of useless code.
So Databricks (founded by the Apache Spark creators) decided to build and open source a solution to all ML framework challenges. Called MLflow, it is a platform that manages the entire machine learning lifecycle (from start to production) and has been designed to work with any library. Ever since it’s release, it has gained a huge following (1,355 stars on GitHub) and you can check out our coverage of the library here.
Another NLP entry in this article! When it comes to NLP tasks like sentiment analysis, or machine translation, the norm has been to build models specific to that task. Have you ever built a sentiment analysis model that can also do semantic parsing and question answering at the same time? That’s what Salesforce researchers intend to do with this repository.
They have published a research paper that outlines a model which can do 10 different NLP tasks at the same time. In this paper, they have thrown a chalenge (which they are calling decaNLP) to the community – can you build such a model and improve on the approach they’ve provided? The model Salesforce have built is being called the “Swiss Army Knife for Natural Language Processing”.
Read more details about this on AV’s post here.
Reinforcement learning is becoming popular by the day and so is the open source community for it. This repository is a collection of reinforcement learning algorithms from Richard Sutton and Andrew Barto’s book and other research papers. These algorithms are presented in the form pf Python notebooks.
The creator of the repository recommends using these notebooks when you read the book as they will significantly enhance your understand of what’s being presented. The notes are detailed and anyone entering this field should definitely refer to this collection.
Your interest in this thread will be piqued by the above video, which is put together and presented really neatly. It sent the machine learning subreddit into overdrive and received almost 100 comments! The thread has a lot of useful information on how the technique was created (there’s a step-by-step explanation from the developer), how long it took, what kind of other things it can do, etc. You’ll learn a lot about computer vision in this thread.
The creator of this technique and video has also open sourced his code on GitHub. So open your Jupyter notebooks and get cracking!
OpenAI Five is a group of 5 neural networks designed and developed to beat human opponents in the popular Dota 2 game. It has been developed by the Elon Musk co-founded OpenAI venture, which explains the immediate popularity it has received since it’s release.
Why I’m recommending this thread is the rich discussion around what else data scientists want to see from this technique, it’s comparison with the popular DeepMind AlphaGo algorithm, and how much computational power it required to pull this off. There is a lot of perspective in this thread that will greatly benefit you.
Additionally, you can also read our article on OpenAI Five here.
If this topic didn’t get your attention, the first few comments surely will. This discussion is like a wish list of what data scientists and machine learning practitioners want to see from the community. This thread made my list because of the discussion each idea spawned. Once a person added their idea to the thread, multiple folks replied with their ideas on how to implement it and if similar research was already present.
This is a MUST-READ discussion – for both enthusiasts and practitioners. Take some time out to go through it and you’ll come out with a lot of knowledge (and perhaps even more questions).
What hardware you use for machine learning plays a critical role in determining how good your model will perform, especially when the amount of data to be trained is huge. Read this thread to find out what other data scientists use for building their ML processes and models. The original poster has listed down a structured list of questions which helped keep the thread neat and understandable. There questions are as below:
- Desktop or laptop? What model?
- Operating System?
- Programming language?
- What kind of work/research do you do?
You can also take part in the discussion or use the comments section below this article to let us know your setup!
As I mentioned above, reinforcement learning is a popular field these days. But due to the complex nature of the work, most of the research and use cases are limited to games and lab environments. In this thread, people already working in this field give their take on where they see RL penetrating in the near future. Some of the comments are of a more skeptical nature but are worth reading to understand what experts and enthusiasts feel about RL.
Phew! So much to read and learn in this past month. This list has something for everybody – NLP, reinforcement learning, open source code you can download and start working on, computer vision, discussions on various machine learning related things, and much more. Use the comments section below to let us know which repository and/or discussion you found the most interesting!