Don’t miss out on these awesome GitHub Repositories & Reddit Threads for Data Science & Machine Learning (May 2018)

Last Updated : 31 May, 2020

6 min read

Introduction

GitHub and Reddit both serve as interesting discovery platforms for me. I not only learn some of the best applications of data science, but also see how they have been written and will hopefully be contributing to some of these repositories in the near future.

GitHub was acquired by Microsoft recently in a multi-billion dollar deal. GitHub has been the ultimate platform for collaboration between developers and we have seen the data science and machine learning community embrace it with equal enthusiasm. We hope this continues under Microsoft’s umbrella as well.

As for Reddit, it continues to be a wonderful source of knowledge and opinion for data scientists. People share links to their code, other people’s codes, general data science news, ask for help and opinions, post research papers, among other things. It’s a truly powerful community that continues to provide a solid platform for interacting with fellow data science enthusiasts.

We saw a few great Reddit discussions in May, including the role of data scientists in the next 3 years and a collection of the best ML papers ever written. In the GitHub community, Intel open sourced it’s NLP architect library, Microsoft unveiled ML.NET to enable machine learning for Dot Net developers, etc.

Let’s dive into the list and look at the top repositories on GitHub and intriguing discussions on Reddit that occurred last month.

You can check out the top GitHub repositories and top Reddit discussions (from April onwards) for the last four months below:

GitHub Repositories

ML.NET

Source: MSPowerUser

ML.NET is an open source machine learning framework which aims to make ML accessible to, you guessed it, .NET developers. It enables them to develop their own models in .NET, all without requiring prior experience in building machine learning models. This is currently a preview release and includes basic classification and regression algorithms.

ML.NET was originally created by Microsoft and has been used across it’s wide range of products, like Windows, Excel, Access, Bing, etc. This release also comes bundled with .NET APIs for various model training model tasks.

NLP Architect

NLP Architect is an open source Python library that enables data scientists to explore state-of-the-art deep learning techniques in the field of natural language processing (NLP) and natural language understandings (NLU). It has been developed and open sourced by researchers at Intel Lab.

One of my favorite components of this library is a visualization component that shows the model’s annotations in a tidy and neat fashion. Check out our coverage of NLP Architect here.

Amazon Scraper

The python package gives you the ability to search and extract product information from Amazon. Instead of writing lines of code to figure out which products you need to analyze, just use this package instead. All you need to input is the keyword you want to search and the maximum number of products (this one is optional). You get the output in a CSV format and you can then plug it into your favorite tool and start analyzing.

PIGO – Face Detection in Go

Pigo is a face detection library that has been developed in the Go programming language. It is based on the ‘Pixel Intensity Comparison -based Object detection’ research paper. According to the repository, some of the key features of this library are:

High processing speed
There is no need for image preprocessing prior to detection
There is no need for the computation of integral images, image pyramid, HOG pyramid or any other similar data structure
The face detection is based on pixel intensity comparison encoded in the binary file dat tree structure

RL-Adventure-2: Policy Gradients

This one is for all the reinforcement learning (RL) enthusiasts. Deep learning has propelled RL to program an AI to play Atari games at human expert level skill. This repository covers interesting new extensions to the policy gradient algorithm, one of the favorite default choices for solving RL problems. These extensions have led to an improvement in training time as well as the overall performance of reinforcement learning.

Reddit Discussions

Real-time Multihand Pose Estimation Demo

This thread took off as soon as the author posted the above concept in video form. It is a fascinating concept and to see it come alive using deep learning is a wonderful thing. It caught the attention of data scientists and ML enthusiasts as you can tell by the amount of questions in the thread. I encourage you to scroll through them all and you will get a very good idea of how this technology was implemented.

Which Research paper would you choose to show that Machine Learning is Beautiful?

If you’re new to machine learning, or are looking for papers to read or refer, this is magnificent thread. There are some excellent machine learning research papers mentioned in this thread which every data scientist, aspiring or established, will hugely benefit from. The thread contains papers ranging from basic machine learning concepts like Gaussian models, to advanced concepts like neural artistic style transfer, rapid object detection using a boosted cascade of simple features, etc. This is essentially a MUST-READ.

What do we currently know about Generalization? What should we be asking next about it?

Generalization in deep learning has been a topic of constant debate. As the author of this post mentioned, there are still quite a few scenarios where we struggle to achieve any generalization at all. This led to a deep discussion around the current state of generalization and why it has been so hard to understand in deep and reinforcement learning. This discussions includes lengthy posts which can get a little complex if you’re new to this field. However I suggest reading through them anyway because these are opinions of some highly experienced and knowledgeable data scientists.

State of Machine Learning in the Healthcare Industry

This thread delves into the current state of machine learning specifically in the healthcare industry (and not in research areas). Data scientists in this industry have shared their experience and opinions about what they have seen in their work. Refer to this thread whenever anyone asks you anything about ML and DL in the life sciences domain!

Potential Career Paths for Data Scientists after 3 Years

This is a very pertinent question most people ask before getting into this field. With the rapid adoption of automated machine learning tools, will companies even need data scientists in a few years? This thread is a collection of opinions about where different people in data science see this role expanding or diversifying in the next few years. There is some excellent career advice in here so make sure you check this out.

End Notes

Some of the above Reddit discussions were really insightful, like the healthcare industry and the generalization thread. I personally love curating these GitHub repositories and Reddit discussions threads because it gives me a high level overview of all that’s happening in the ML research community.

Which repository and/or discissions did you find the most interesting out of these? Get involved and let us know in the comments section below!

Beginner Data Science Deep Learning Github Listicle

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Introduction to Deep Learning

Feed Forward Networks

Feed Forward Networks

Gradient Descent

Loss Function

Activation Functions

Introduction to Neural networks

Forward and Backward Propagation

Optimizers

Learning Rate Schedulers

NN on Structured Data

Improving the Deep Learning Model

Deep Learning Model Optimization

Unsupervised Deep Learning

AutoDL

Model Deployment

Introduction to PyTorch

Don’t miss out on these awesome GitHub Repositories & Reddit Threads for Data Science & Machine Learning (May 2018)

Introduction

GitHub Repositories

Reddit Discussions

End Notes

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#