Coding is among one of the best things about being a data scientist. There are often days when I find myself immersed in programming something from scratch. That exhilarating feeling you get when you see your hard work culminate in a successful model? Exhilarating and unparalleled!
But as a data scientist (or a programmer), its equally important to create checkpoints of your code at various intervals. It’s incredibly helpful to know where you started off from last time so if you have to rollback your code or simply branch out to a different path, there’s always a fallback option. And that’s why GitHub is such an excellent platform.
The previous posts in this monthly series have expounded on why every data scientist should have an active GitHub account. Whether it’s for collaboration, resume/portfolio, or educational purposes, it’s simply the best place to enhance your coding skills and knowledge.
And now let’s get to the core of our article – machine learning code! I have picked out some really interesting repositories which I feel every data scientist should try out on their own.
Apart from coding, there are tons of aspects associated with being a data scientist. We need to be aware of all the latest developments in the community, what other machine learning professionals and thought leaders are talking about, what are the moral implications of working on a controversial project, etc. That is what I aim to bring out in the Reddit discussion threads I showcase every month.
To make things easier for you, here’s the entire collection so far of the top GitHub repositories and Reddit discussions (from April onwards) we have covered each month:
Keeping our run going of including reinforcement learning resources in this series, here’s one of the best so far – OpenAI’s Spinning Up! This is an educational resource open sourced with the aim of making it easier to learn deep RL. Given how complex it can appear to most folks, this is quite a welcome repository.
The repo contains a few handy resources:
- An introduction to RL terminology, kinds of algorithms, and basic theory
- An essay about how to grow into an RL research role
- A curated list of important papers organized by topic
- A code repo of short, standalone implementations of key algorithms
- A few exercises to get your hands dirty
This one is for all the audio/speech processing people out there. WaveGlow is a flow-based generative network for speech synthesis. In other words, it’s a network (yes, a single network!) that can generate impressive high quality speech from mel-spectrograms.
This repo contains the PyTorch implementation of WaveGlow and a pre-trained model to get you started. It’s a really cool framework, and you can check out the below links as well if you wish to delve deeper:
We covered the PyTorch implmentation of BERT in last month’s article, and here’s a different take on it. For those who are new to BERT, it stands for Bidirectional Encoder Representations from Transformers. It’s basically a method for pre-training language representations.
BERT has set the NLP world ablaze with it’s results, and the folks at Google have been kind enough to release quite a few pre-trained models to get you on your way.
This repository “uses BERT as the sentence encoder and hosts it as a service via ZeroMQ, allowing you to map sentences into fixed-length representations in just two lines of code”. It’s easy to use, extremely quick, and scales smoothly. Try it out!
Quick, Draw is a popular online game developed by Google where a neural network tries to guess what you’re drawing. The neural network learns from each drawing, hence increasing it’s already impressive ability to correctly guess the doodle. The developers have built up a HUGE dataset from the amount of drawings users have made previously. It’s an open-source dataset which you can check out here.
And now you can build your own Quick, Draw game in Python with this repository. There is a step-by-step explanation of how to do this. Using this code, you can run an app to either draw in front of the computer’s webcam, or on a canvas.
GAN Dissection, pioneered by researchers at MIT’s Computer Science & Artificial Intelligence Laboratory, is a unique way of visualizing and understanding the neurons of Generative Adversarial Networks (GANs). But it isn’t just limited to that – the researchers have also created GANPaint to showcase how GAN Dissection works.
This helps you explore what a particular GAN model has learned by inspecting and manipulating it’s internal neurons. Check out the research paper here and the below video demo, and then head straight to the GitHub repository to dive straight into the code!
Has this question ever crossed your mind while learning basic machine learning concepts? This is one of the fundamental algorithms we come across in our initial learning days and has proven to be quite effective in ML competitions as well. But once you start going through this thread, prepare to seriously question what you’ve studied previously.
What started off as a straight forward question turned into a full-blown discussion among the top minds on Reddit. I thoroughly enjoyed browsing through the comments and I’m sure anyone with interest in this field (and mathematical rigour) will find it useful.
What do you do when the developer of a complex and massive neural network vanishes without leaving behind the documentation needed to understand it? This isn’t a fictional plot, but a rather common situation the original poster of the thread found himself in.
It’s a situation that happens regularly with developers but takes on a whole new level of intrigue when it comes to deep learning. This thread explores the different ways a data scientist can go about examining how a deep neural network model was initially designed. The responses range from practical to absurd, but each adds a layer of perspective which could help you one day if you ever face this predicament.
My attention to this thread was drawn by the sheer number of comments (110 at the time of writing) – what in the world could be so controversial about this topic? But when you started scrolling down, the sheer difference in opinions among the debators is mind boggling. Apart from TensorFlow being derided for being “not the best framework”, there’s a lot of love being shown to PyTorch (which isn’t all that surprising if you’ve used PyTorch).
It all started when Francois Chollet posted his thoughts on GitHub and lit a (metaphorical) fire under the machine learning community.
Another OpenAI entry in this post – and yet another huge breakthrough by them. The title might not leap out of the page as anything special but it’s important to understand what the OpenAI team have conjured up here. As one of the Redditors pointed out, this takes us one step closer to machines mimicking human behavior.
It took around a year of total experience to beat the Montezuma’s Revenge game at a super human level – pretty impressive!
This one is for all the aspiring data scientists reading the article. The author of the thread expounds on how he landed the coveted job, his background, where he studied data science from, etc. After answering these standard questions, he has actually written a very nice post on what others in a similar position can do to further their ambitions.
There are some helpful comments as well if you scroll down a little bit. And of course, you can post your own question(s) to the author there.
Quite a collection this month. I found the GAN Dissection repository quite absorbing. I’m currently in the process of trying to replicate it on my own machine – should be quite the ride. I’m also keeping an eye on the ‘Reverse Engineering a Massive Neural Network’ thread as the ideas spawning there could be really helpful in case I ever find myself in that situation.
Which GitHub repository and Reddit thread stood out for you? Which one will you tackle first? Let me know in the comments section below!You can also read this article on Analytics Vidhya's Android APP