Top 5 Data Science GitHub Repositories and Reddit Discussions (January 2019)

Sharoon Saxena 24 May, 2020 • 7 min read

Introduction

There’s nothing quite like GitHub and Reddit for data science. Both platforms have been of immense help to me in my data science journey.

GitHub is the ultimate one-stop platform for hosting your code. It excels at easing the collaboration process between team members. Most leading data scientists and organizations use GitHub to open-source their libraries and frameworks. So not only do we stay up-to-date with the latest developments in our field, we get to replicate their models on our own machines!

Reddit discussions are on the same end of that spectrum. Leading researchers and brilliant minds come together to discuss and extrapolate the latest topics and breakthroughs in machine learning and data science. There is A LOT to learn from these two platforms.

I have made it a habit to check both these platforms at least twice a week. It’s changed the way I learn data science. I encourage everyone reading this to do the same!

In this article, we’ll focus on the latest open-source GitHub libraries and Reddit discussions from January 2019. Happy learning!

You can also browse through the 25 best GitHub repositories from 2018. The list contains libraries covering multiple and diverse domains, including NLP, Computer Vision, GANs, AutoML, among others.

GitHub Repositories

Flair (State-of-the-Art NLP Library)

alt text

2018 was a watershed year for Natural Language Processing (NLP). Libraries like ELMo and Google’s BERT were ground-breaking releases. As Sebastian Ruder said, “NLP’s ImageNet moment has arrived“!

Let’s keep that trend going into the new year! Flair is another superb NLP library that’s easy to understand and implement. And the best part? It’s very much state-of-the-art!

Flair was developed and open-sourced by Zalando Research and is based on PyTorch. The library has outperformed previous approaches on a wide range of NLP tasks:

Here, F1 is the accuracy evaluation metric. I am currently exploring this library and plan to pen down my thoughts in an article soon. Keep watching this space!

face.evoLVe – High Performance Face Recognition Library

Face recognition algorithms for computer vision are ubiquitous in data science now. We covered a few libraries in last year’s GitHub series as well. Add this one to the growing list of face recognition libraries you must try out.

face.evoLVe is a “High Performance Face Recognition Library” based on PyTorch. It provides comprehensive functions for face related analytics and applications, including:

Face alignment (detection, landmark localization, affine transformation)
Data pre-processing (e.g., augmentation, data balancing, normalization)
Various backbones (e.g., ResNet, DenseNet, LightCNN, MobileNet, etc.)
Various losses (e.g., Softmax, Center, SphereFace, AmSoftmax, Triplet, etc.)
A bag of tricks for improving performance (e.g., training refinements, model tweaks, knowledge distillation, etc.).

This library is a must-have for the practical use and deployment of high performance deep face recognition, especially for researchers and engineers.

YOLOv3

YOLO is a supremely fast and accurate framework for performing object detection tasks. It was launched three years back and has seen a few iterations since, each better than the last.

This repository is a complete pipeline of YOLOv3 implemented in TensorFlow. This can be used on a dataset to train and evaluate your own object detection model. Below are the key highlights of this repository:

Efficient tf.data pipeline
Weights converter
Extremely fast GPU non maximum suppression
Full training pipeline
K-means algorithm to select prior anchor boxes

If you’re new to YOLO and are looking to understand how it works, I highly recommend checking out this essential tutorial.

FaceBoxes: A CPU Real-Time Face Detector with High Accuracy

One of the biggest challenges in computer vision is managing computational resources. Not everyone has multiple GPUs lying around. It’s been quite a hurdle to overcome.

Step up FaceBoxes. It’s a novel face detecting approach that’s shown impressive performance on both speed and accuracy using CPUs.

This repository in a PyTorch implementation of FaceBoxes. It contains the code to install, train and evaluate a face detection model. No more complaining about a lack of computation power – give FaceBoxes a try today!

Transformer-XL from Google AI

Here’s another game-changing NLP framework. It’s no surprise to see the Google AI team behind it (they’re the ones who came up with BERT as well).

Long range dependencies have been a thorn in the side of NLP. Even with the significant progress made last year, this concept wasn’t quite dealt with. RNN and Vanilla transformers were used but they were not quite good enough. THat gap has now been filled by Google AI’s Transformer-XL. A few key points to note about this library:

Transformer-XL is able to learn long range dependencies about 80% longer than RNNs and 450% longer than vanilla Transformers
Even on the computational front, Transformer-XL is about 1800+ times faster than Vanilla Transformer!
Transformer-XL has better performance in perplexity (more accurate at predicting a sample) on long sequences because of long-term dependency modeling

This repository contains the code for Transformer-XL in both TensorFlow and PyTorch. See if you can match (or even beat) the state-of-the-art results in NLP!

There were a few other awesome data science repositories created in January. Make sure you check them out:

Reddit Discussions

Data Scientist is the new Business Analyst

Don’t be fooled by the hot-take in the headline. This is a serious discussion about the current state of data science and how it’s taught around the world.

It’s always been difficult to pin down specific labels on different data science roles. The functions and tasks vary – so who should learn exactly what? This thread looks at how educational institutes are only covering the basic concepts and claiming to teach data science.

For all of you who are in the early stage of learning – make sure you browse through this discussion. You’ll learn a lot about how recruiters perceive potential candidates holding a certification or degree from an institute claiming they are data scientists.

You’ll of course learn a bit about what a business analyst does as well, and how that’s different to the data scientist role.

What is Something in Data Science that Blew your Mind?

What is that one thing about data science that made you go “WOW”. For me, it was when I realized how I could use data science as a game-changer in the sports industry.

There are a lot of uncanny theories and facts in this discussion thread that will keep you engaged. Here are a couple of cool answers taken from the thread:

“How much of the world can be modeled with well known distributions. The fact that so many things are normally distributed makes me think we are in a simulation.”

“The first thing that ever blew my mind and wanted me to pursue a career in data science was United Airlines saving 170,000 of fuel each year by changing the type of paper used to make their in flight magazine.”

The Things Top Data Scientists Struggled with Early in their Career

Most data scientists will vouch that they had a difficult time understanding certain concepts during their initial days. Even something as straightforward as imputing missing values can become an arduous exercise in frustration.

This thread is a goldmine for all you data science enthusiasts. It comprises of experienced data scientists sharing their experience on how they managed to learn or get past concepts they initially found hard to grasp. Some of these might even seem familiar to you:

“Hardest part was learning how the input shapes of different types (DNN, RNN, CNN) work. I think i spend ~20 hours on figuring out the RNN input shape.”

“What was and is still challenging each time, is to setup up the development environment on a system. Installing CUDA, Tensorflow, PyCharm. Those are always days of horror and despair.”

“Configuring TensorFlow to work with my GPU took hours of Googling and trial and error.”

Why do Deep Neural Networks Generalize Well?

Neural networks have long had a “black box” reputation (it’s not really true anymore). Things get even more muddy when the concept expands to deep neural networks (DNNs). These DNNs are at the heart of plenty of recent state-of-the-art results so it’s essential to understand how they work.

A key question discussed in this thread is on how deep neural networks generalize so well. If you were of the same thought that there’s no answer to that – prepare to have your mind blown!

This thread comprises of views and perspective of put forth by deep learning experts. There’s a lot of links and resources included to dive deeper into the topic as well. But do note that a basic understanding of neural networks will help you get more involved in the discussion.

You can learn more about Neural Networks here.

AMA with DeepMind’s AlphaStar Team!

Google’s DeepMind stunned the world when their AlphaGo creation beat Go champion Lee Sedol. They’ve gone and done it again!

Their latest algorithm, AlphaStar, was trained on the popular StarCraft 2 game. AlphaStar emphatically swatted aside the top two StarCraft players, winning by an impressive 10-1 margin.

This Reddit discussion thread was a AMA (Ask Me Anything) hosted by two DeepMind AlphaStar’s creators. They discussed a wide-variety of topics with the Reddit community, explaining how the algorithm works, how much training data was used, what the hardware setup was like, etc.

A couple of interesting questions covered in the discussion:

“How many games needed to be played out in order to get to the current level? Or in other words: how many games is 200 years of learning in your case?”

“What other approaches were tried? I know people were quite curious about whether any tree searches, deep environment models, or hierarchical RL techniques would be involved, and it appears none of them were; did any of them make respectable progress if tried?”

End Notes

What a way to start 2019! Progress in NLP is happening at a breakneck pace. Do watch out for my article on Flair soon. Of course, DeepMind’s AlphaStar has also been a huge breakthrough in reinforcement learning. Let’s hope this can be modelled n a real-world scenario soon.

What are your thoughts on this? Which library did you find the most useful? Let me know your feedback in the comments section below.