Top 5 Data Science GitHub Repositories and Reddit Discussions (January 2019)

Last Updated : 24 May, 2020

7 min read

Introduction

There’s nothing quite like GitHub and Reddit for data science. Both platforms have been of immense help to me in my data science journey.

GitHub is the ultimate one-stop platform for hosting your code. It excels at easing the collaboration process between team members. Most leading data scientists and organizations use GitHub to open-source their libraries and frameworks. So not only do we stay up-to-date with the latest developments in our field, we get to replicate their models on our own machines!

Reddit discussions are on the same end of that spectrum. Leading researchers and brilliant minds come together to discuss and extrapolate the latest topics and breakthroughs in machine learning and data science. There is A LOT to learn from these two platforms.

I have made it a habit to check both these platforms at least twice a week. It’s changed the way I learn data science. I encourage everyone reading this to do the same!

In this article, we’ll focus on the latest open-source GitHub libraries and Reddit discussions from January 2019. Happy learning!

You can also browse through the 25 best GitHub repositories from 2018. The list contains libraries covering multiple and diverse domains, including NLP, Computer Vision, GANs, AutoML, among others.

GitHub Repositories

Flair (State-of-the-Art NLP Library)

alt text

2018 was a watershed year for Natural Language Processing (NLP). Libraries like ELMo and Google’s BERT were ground-breaking releases. As Sebastian Ruder said, “NLP’s ImageNet moment has arrived“!

Let’s keep that trend going into the new year! Flair is another superb NLP library that’s easy to understand and implement. And the best part? It’s very much state-of-the-art!

Flair was developed and open-sourced by Zalando Research and is based on PyTorch. The library has outperformed previous approaches on a wide range of NLP tasks:

Here, F1 is the accuracy evaluation metric. I am currently exploring this library and plan to pen down my thoughts in an article soon. Keep watching this space!

face.evoLVe – High Performance Face Recognition Library

Face recognition algorithms for computer vision are ubiquitous in data science now. We covered a few libraries in last year’s GitHub series as well. Add this one to the growing list of face recognition libraries you must try out.

face.evoLVe is a “High Performance Face Recognition Library” based on PyTorch. It provides comprehensive functions for face related analytics and applications, including:

Face alignment (detection, landmark localization, affine transformation)
Data pre-processing (e.g., augmentation, data balancing, normalization)
Various backbones (e.g., ResNet, DenseNet, LightCNN, MobileNet, etc.)
Various losses (e.g., Softmax, Center, SphereFace, AmSoftmax, Triplet, etc.)
A bag of tricks for improving performance (e.g., training refinements, model tweaks, knowledge distillation, etc.).

This library is a must-have for the practical use and deployment of high performance deep face recognition, especially for researchers and engineers.

YOLOv3

YOLO is a supremely fast and accurate framework for performing object detection tasks. It was launched three years back and has seen a few iterations since, each better than the last.

This repository is a complete pipeline of YOLOv3 implemented in TensorFlow. This can be used on a dataset to train and evaluate your own object detection model. Below are the key highlights of this repository:

Efficient tf.data pipeline
Weights converter
Extremely fast GPU non maximum suppression
Full training pipeline
K-means algorithm to select prior anchor boxes

If you’re new to YOLO and are looking to understand how it works, I highly recommend checking out this essential tutorial.

FaceBoxes: A CPU Real-Time Face Detector with High Accuracy

One of the biggest challenges in computer vision is managing computational resources. Not everyone has multiple GPUs lying around. It’s been quite a hurdle to overcome.

Step up FaceBoxes. It’s a novel face detecting approach that’s shown impressive performance on both speed and accuracy using CPUs.

This repository in a PyTorch implementation of FaceBoxes. It contains the code to install, train and evaluate a face detection model. No more complaining about a lack of computation power – give FaceBoxes a try today!

Transformer-XL from Google AI

Here’s another game-changing NLP framework. It’s no surprise to see the Google AI team behind it (they’re the ones who came up with BERT as well).

Long range dependencies have been a thorn in the side of NLP. Even with the significant progress made last year, this concept wasn’t quite dealt with. RNN and Vanilla transformers were used but they were not quite good enough. THat gap has now been filled by Google AI’s Transformer-XL. A few key points to note about this library:

Transformer-XL is able to learn long range dependencies about 80% longer than RNNs and 450% longer than vanilla Transformers
Even on the computational front, Transformer-XL is about 1800+ times faster than Vanilla Transformer!
Transformer-XL has better performance in perplexity (more accurate at predicting a sample) on long sequences because of long-term dependency modeling

This repository contains the code for Transformer-XL in both TensorFlow and PyTorch. See if you can match (or even beat) the state-of-the-art results in NLP!

There were a few other awesome data science repositories created in January. Make sure you check them out:

Reddit Discussions

Data Scientist is the new Business Analyst

Don’t be fooled by the hot-take in the headline. This is a serious discussion about the current state of data science and how it’s taught around the world.

It’s always been difficult to pin down specific labels on different data science roles. The functions and tasks vary – so who should learn exactly what? This thread looks at how educational institutes are only covering the basic concepts and claiming to teach data science.

For all of you who are in the early stage of learning – make sure you browse through this discussion. You’ll learn a lot about how recruiters perceive potential candidates holding a certification or degree from an institute claiming they are data scientists.

You’ll of course learn a bit about what a business analyst does as well, and how that’s different to the data scientist role.

What is Something in Data Science that Blew your Mind?

What is that one thing about data science that made you go “WOW”. For me, it was when I realized how I could use data science as a game-changer in the sports industry.

There are a lot of uncanny theories and facts in this discussion thread that will keep you engaged. Here are a couple of cool answers taken from the thread:

“How much of the world can be modeled with well known distributions. The fact that so many things are normally distributed makes me think we are in a simulation.”

“The first thing that ever blew my mind and wanted me to pursue a career in data science was United Airlines saving 170,000 of fuel each year by changing the type of paper used to make their in flight magazine.”

The Things Top Data Scientists Struggled with Early in their Career

Most data scientists will vouch that they had a difficult time understanding certain concepts during their initial days. Even something as straightforward as imputing missing values can become an arduous exercise in frustration.

This thread is a goldmine for all you data science enthusiasts. It comprises of experienced data scientists sharing their experience on how they managed to learn or get past concepts they initially found hard to grasp. Some of these might even seem familiar to you:

“Hardest part was learning how the input shapes of different types (DNN, RNN, CNN) work. I think i spend ~20 hours on figuring out the RNN input shape.”

“What was and is still challenging each time, is to setup up the development environment on a system. Installing CUDA, Tensorflow, PyCharm. Those are always days of horror and despair.”

“Configuring TensorFlow to work with my GPU took hours of Googling and trial and error.”

Why do Deep Neural Networks Generalize Well?

Neural networks have long had a “black box” reputation (it’s not really true anymore). Things get even more muddy when the concept expands to deep neural networks (DNNs). These DNNs are at the heart of plenty of recent state-of-the-art results so it’s essential to understand how they work.

A key question discussed in this thread is on how deep neural networks generalize so well. If you were of the same thought that there’s no answer to that – prepare to have your mind blown!

This thread comprises of views and perspective of put forth by deep learning experts. There’s a lot of links and resources included to dive deeper into the topic as well. But do note that a basic understanding of neural networks will help you get more involved in the discussion.

You can learn more about Neural Networks here.

AMA with DeepMind’s AlphaStar Team!

Google’s DeepMind stunned the world when their AlphaGo creation beat Go champion Lee Sedol. They’ve gone and done it again!

Their latest algorithm, AlphaStar, was trained on the popular StarCraft 2 game. AlphaStar emphatically swatted aside the top two StarCraft players, winning by an impressive 10-1 margin.

This Reddit discussion thread was a AMA (Ask Me Anything) hosted by two DeepMind AlphaStar’s creators. They discussed a wide-variety of topics with the Reddit community, explaining how the algorithm works, how much training data was used, what the hardware setup was like, etc.

A couple of interesting questions covered in the discussion:

“How many games needed to be played out in order to get to the current level? Or in other words: how many games is 200 years of learning in your case?”

“What other approaches were tried? I know people were quite curious about whether any tree searches, deep environment models, or hierarchical RL techniques would be involved, and it appears none of them were; did any of them make respectable progress if tried?”

End Notes

What a way to start 2019! Progress in NLP is happening at a breakneck pace. Do watch out for my article on Flair soon. Of course, DeepMind’s AlphaStar has also been a huge breakthrough in reinforcement learning. Let’s hope this can be modelled n a real-world scenario soon.

What are your thoughts on this? Which library did you find the most useful? Let me know your feedback in the comments section below.

Computer Vision Data Science Deep Learning Github Intermediate

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Deep Learning

Feed Forward Networks

Feed Forward Networks

Gradient Descent

Loss Function

Activation Functions

Introduction to Neural networks

Forward and Backward Propagation

Optimizers

Learning Rate Schedulers

NN on Structured Data

Improving the Deep Learning Model

Deep Learning Model Optimization

Unsupervised Deep Learning

AutoDL

Model Deployment

Introduction to PyTorch

Top 5 Data Science GitHub Repositories and Reddit Discussions (January 2019)

Introduction

GitHub Repositories

There were a few other awesome data science repositories created in January. Make sure you check them out:

Reddit Discussions

End Notes

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID