Top 5 Data Science GitHub Repositories and Reddit Discussions (February 2019)

Pranav Dar 12 May, 2020

7 min read

Introduction

I love GitHub. I have been a ~~regular~~ daily user of the various features the platform offers. That wasn’t always the case, however.

I had vaguely heard about GitHub during my early data science learning days. The people I spoke to, even some of the influencers, espoused the value of GitHub as a code hosting / sharing / showcase platform. And since I was only just starting to learning R, I couldn’t really map the need of such a platform.

How wrong I was! GitHub is a goldmine for data science professionals, regardless of whether you’re established or just starting out. GitHub will be of tremendous help irrespective of whether you are learning / following NLP, Computer Vision, GANs or any other data science development.

I was truly won over once I realized all the big data science focused companies (Google, Facebook, Amazon, Uber, etc.) regularly open sourced their code on the platform.

It is the best way to keep up with the breakneck developments happening in our field. You even get to download the code and replicate it on your own machine! What more could a data scientist ask for?

In this article, we continue our monthly series of showcasing the best GitHub repositories and Reddit discussions from the month just gone by. February was a HUGE month in terms of open source data science libraries.

Let’s get cracking!

You should also check out our top GitHub and Reddit picks for January here:

January 2019 Edition

Top Data Science GitHub Repositories (February 2019)

StyleGAN – Generating Life-Like Human Faces

The above image seems like a typical collage – nothing to see here. What if I told you none of the people in this collection are real? That’s right – these folks do not exist.

All these faces were produced by an algorithm called StyleGAN. While GANs have been getting steadily better since their invention a few years back, StyleGAN has taken the game up by several notches. The developers have proposed two new, automated methods to quantify the quality of these images and also open sourced a massive high-quality dataset of faces.

This repository contains the official TensorFlow implementation of the algorithm. Below are a few key resources to learn more about StyleGAN:

Link	Description
http://stylegan.xyz/paper	Paper PDF.
http://stylegan.xyz/video	Result video.
http://stylegan.xyz/code	Source code.
http://stylegan.xyz/ffhq	Flickr-Faces-HQ dataset.
http://stylegan.xyz/drive	Google Drive folder.

OpenAI’s Ground-Breaking Language Model – GPT-2

GPT-2 won the unofficial “most talked about” Natural Language Processing (NLP) library award in February. The way they went about launching GPT-2 raised quite a few eyebrows. The team claims that the model works so well they cannot fully open source it for fear of malicious use.

You can imagine why that attracted headlines and questions. They have, however, released a smaller version of the model which is available on this GitHub repository we’ve linked above.

GPT-2 is a large language model with 1.5 billion parameters. The model has been trained on a dataset of 8 million web pages. The aim behind the model is to predict the next word, given all the previous words within some text. Is it state-of-the-art? We’ll have to take OpenAI’s word for it (for now).

Here are a couple of additional resources to learn more about GPT-2:

SC-FEGAN : Face Editing Generative Adversarial Network with User’s Sketch and Color

Another GAN library?! That’s right – GANs are taking the data science world by storm. SC-FEGAN is as cool in terms of style as the StyleGAN algorithm we covered above.

The above image perfectly illustrates what SC-FEGAN does. You can edit all sorts of facial images using the deep neural network the developers have trained. We can all become artists just sitting in front of our computers!

The repository helpfully includes steps to help you build the SC-FEGAN model on your own machine. Give it a try! And if computational power is a challenge, hop over to Google Colaboratory and utilize their free GPU offering.

LazyNLP for Creating Massive Text Datasets

The premise behind LazyNLP is simple – it enables you to crawl, clean up and deduplicate websites to create massive monolingual datasets.

What do I mean by massive? According to the developer, LazyNLP will allow you to create datasets larger than the one used by OpenAI for training the GPT-2 model. The full scale one. That certainly had my full attention.

This GitHub repository lists down the 5 steps you’ll need to follow to create your own custom NLP dataset. If you’re in any way interested in NLP, you should definitely check out this release.

Subsync – Automating Subtitles Synchronization with the Video

How often have you found yourself frustrated at subtitles being out of sync with the video? This repository happens to be your savior in such situations.

Subsync is about “language-agnostic automatic synchronization of subtitles to video, so that subtitles are aligned to the correct starting point within the video”. The algorithm was built using the Fast Fourier Transform technique in Python.

Subsync works inside the VLC Media Player as well! The model takes about 20-30 seconds to train (depending on the video length).

BONUS: Flickr-Faces-HQ Dataset (FFHQ)

I wanted to include this in the article for anyone searching for high-quality images. The dataset consists of 70,000 super high-quality images (1024 x 1024). There’s a lot of variety in the faces, such as age, ethnicity, image background, etc.

It’s ideal for learning and experimenting with GANs. Let me know in the comments section below if you use it!

Reddit Discussions

Are you Expected to Solve Hard Coding Challenges to work in the Machine Learning Industry?

I like this question because of how relevant it is in today’s world. The thread has close to 200 comments from experienced data scientists and machine learning researchers debating whether these coding challenges are a good or bad thing in an interview round.

There’s a lot of experience here so this is a discussion you really should pay close attention to. The essential question it comes down to is – should data science/machine learning professionals be judged extremely tightly on their coding skills or should algorithms/concepts take preference?

We also aim to help you crack these data science interviews in our course offering. Make sure you check it out!

Key Points Every Student (Graduate or Post-Graduate) Should Keep in Mind While Pursuing Machine Learning

If you’re a full-time student trying to pursue machine learning on the side – this thread is for you. The author of the post has very lucidly written down the pain points he/she is facing in this regard. I’m sure a lot of you will relate to these challenges.

There is a lot of solid advice in this thread. I personally liked this bit:

It sounds like you may be casting too wide a net initially. If you don’t have a comfortable mental framework in place yet for how to organize information about different sub-parts of ML, then each sub-part of ML is going to be its own independent thing for you to learn about. Research papers, etc aren’t a good place to start.

The Significance of p=0.05

Have you ever wondered why the cut-off for p-value is 0.05? This Reddit discussion initially started as a humorous one but turned into a full-blown statistics discussion pretty quickly.

This is a good place to understand the significance not just of p-value, but of statistics in general. I really liked this comment from one of the users:

P-values are one of a number of useful indicators but that’s about it. People look to stats expecting it to have hard and fast rules, but that’s not how it works. Stats isnt about giving you the right answer, it’s about giving you a less-wrong answer.

A List of Very Useful but Lesser-Known Python Libraries

We are all familiar with Pandas, NumPy and Scikit-Learn. We see these in articles all the time and rely heavily on them during our own projects. We can call them “essential” to our day-to-day work.

But there are plenty of lesser known Python libraries that can be potentially useful. These libraries go under the radar due to various reasons. This thread is a collection of tons of such libraries with their potential applications.

I particularly found plotnine helpful for my work with visualizations. Which one(s) did you like?

A list of Lesser-Known R Packages

I get a lot of messages from R users asking when we’ll show the tool some love. Here you go! Much like the above Python thread, this one focuses on some of the more useful R packages.

My favourite? Catterplots! The package allows you to create scatter plots with cat shaped points. It’s majestic. If you work with MS Office tools a lot, you might find the ‘officer’ package helpful.

End Notes

I have been curating this monthly list for over a year now and each month I learn something that blows my mind. I strongly encourage you to regularly check the ‘Trending’ GitHub section to see what’s doing well.

Which GitHub repository and Reddit discussion did you like? Let me know in the comment section below!

Also – if your discovery sensors want more – check out the top GitHub repositories from 2018.

data science deep learning Github machine learning python

Pranav Dar 12 May, 2020

Senior Editor at Analytics Vidhya. Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

Beginner Data Science Deep Learning Github Listicle