Don’t miss out on these awesome GitHub Repositories & Reddit Threads for Data Science & Machine Learning (May 2018)

Pranav Dar 31 May, 2020
6 min read


GitHub and Reddit both serve as interesting discovery platforms for me. I not only learn some of the best applications of data science, but also see how they have been written and will hopefully be contributing to some of these repositories in the near future.

GitHub was acquired by Microsoft recently in a multi-billion dollar deal. GitHub has been the ultimate platform for collaboration between developers and we have seen the data science and machine learning community embrace it with equal enthusiasm. We hope this continues under Microsoft’s umbrella as well.

As for Reddit, it continues to be a wonderful source of knowledge and opinion for data scientists. People share links to their code, other people’s codes, general data science news, ask for help and opinions, post research papers, among other things. It’s a truly powerful community that continues to provide a solid platform for interacting with fellow data science enthusiasts.

We saw a few great Reddit discussions in May, including the role of data scientists in the next 3 years and a collection of the best ML papers ever written. In the GitHub community, Intel open sourced it’s NLP architect library, Microsoft unveiled ML.NET to enable machine learning for Dot Net developers, etc.

Let’s dive into the list and look at the top repositories on GitHub and intriguing discussions on Reddit that occurred last month.

You can check out the top GitHub repositories and top Reddit discussions (from April onwards) for the last four months below:


GitHub Repositories


                                                                                          Source: MSPowerUser

ML.NET is an open source machine learning framework which aims to make ML accessible to, you guessed it, .NET developers. It enables them to develop their own models in .NET, all without requiring prior experience in building machine learning models. This is currently a preview release and includes basic classification and regression algorithms.

ML.NET was originally created by Microsoft and has been used across it’s wide range of products, like Windows, Excel, Access, Bing, etc. This release also comes bundled with .NET APIs for various model training model tasks.


NLP Architect

NLP Architect is an open source Python library that enables data scientists to explore state-of-the-art deep learning techniques in the field of natural language processing (NLP) and natural language understandings (NLU). It has been developed and open sourced by researchers at Intel Lab.

One of my favorite components of this library is a visualization component that shows the model’s annotations in a tidy and neat fashion. Check out our coverage of NLP Architect here.


Amazon Scraper

The python package gives you the ability to search and extract product information from Amazon. Instead of writing lines of code to figure out which products you need to analyze, just use this package instead. All you need to input is the keyword you want to search and the maximum number of products (this one is optional). You get the output in a CSV format and you can then plug it into your favorite tool and start analyzing.

PIGO – Face Detection in Go

Pigo is a face detection library that has been developed in the Go programming language. It is based on the ‘Pixel Intensity Comparison -based Object detection’ research paper. According to the repository, some of the key features of this library are:
  • High processing speed
  • There is no need for image preprocessing prior to detection
  • There is no need for the computation of integral images, image pyramid, HOG pyramid or any other similar data structure
  • The face detection is based on pixel intensity comparison encoded in the binary file dat tree structure


This one is for all the reinforcement learning (RL) enthusiasts. Deep learning has propelled RL to program an AI to play Atari games at human expert level skill. This repository covers interesting new extensions to the policy gradient algorithm, one of the favorite default choices for solving RL problems. These extensions have led to an improvement in training time as well as the overall performance of reinforcement learning.

Reddit Discussions

Real-time Multihand Pose Estimation Demo

This thread took off as soon as the author posted the above concept in video form. It is a fascinating concept and to see it come alive using deep learning is a wonderful thing. It caught the attention of data scientists and ML enthusiasts as you can tell by the amount of questions in the thread. I encourage you to scroll through them all and you will get a very good idea of how this technology was implemented.


Which Research paper would you choose to show that Machine Learning is Beautiful?

If you’re new to machine learning, or are looking for papers to read or refer, this is magnificent thread. There are some excellent machine learning research papers mentioned in this thread which every data scientist, aspiring or established, will hugely benefit from. The thread contains papers ranging from basic machine learning concepts like Gaussian models, to advanced concepts like neural artistic style transfer, rapid object detection using a boosted cascade of simple features, etc. This is essentially a MUST-READ.


What do we currently know about Generalization? What should we be asking next about it?

Generalization in deep learning has been a topic of constant debate. As the author of this post mentioned, there are still quite a few scenarios where we struggle to achieve any generalization at all. This led to a deep discussion around the current state of generalization and why it has been so hard to understand in deep and reinforcement learning. This discussions includes lengthy posts which can get a little complex if you’re new to this field. However I suggest reading through them anyway because these are opinions of some highly experienced and knowledgeable data scientists.


State of Machine Learning in the Healthcare Industry

This thread delves into the current state of machine learning specifically in the healthcare industry (and not in research areas). Data scientists in this industry have shared their experience and opinions about what they have seen in their work. Refer to this thread whenever anyone asks you anything about ML and DL in the life sciences domain!


Potential Career Paths for Data Scientists after 3 Years

This is a very pertinent question most people ask before getting into this field. With the rapid adoption of automated machine learning tools, will companies even need data scientists in a few years? This thread is a collection of opinions about where different people in data science see this role expanding or diversifying in the next few years. There is some excellent career advice in here so make sure you check this out.


End Notes

Some of the above Reddit discussions were really insightful, like the healthcare industry and the generalization thread. I personally love curating these GitHub repositories and Reddit discussions threads because it gives me a high level overview of all that’s happening in the ML research community.

Which repository and/or discissions did you find the most interesting out of these? Get involved and let us know in the comments section below!


Pranav Dar 31 May, 2020

Senior Editor at Analytics Vidhya. Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers


Ajay Chander R.
Ajay Chander R. 17 Jun, 2018

Thank you for sharing the useful information about data science stuff with repositories from github. Will check into it and pass out the details to my other friends whoever needs it.

Felix 18 Jun, 2018

Thanks, very interesting update.