The 5 Best Machine Learning GitHub Repositories & Reddit Threads from August 2018

Pranav Dar 25 May, 2020 • 7 min read

Introduction

When I started using GitHub early last year, I had never imagined how useful it would become for me. Initially I only used it to upload my own code, assuming that was the extent to which GitHub would prove it’s usefulness. But as I joined Analytics Vidhya and my scope of research expanded, I was enthralled by how vast this platform really is.

Apart from allowing me access to open source codes and projects from top companies like Google, Microsoft, NVIDIA, Facebook, etc., it opened up avenues to collaborate on existing projects with fellow machine learning enthusiasts. I cannot tell you how amazing it feels to have contributed to a project that other people use. It’s a feeling like no other. And this, of course, led me to write this monthly series which I hope you have found beneficial in your own line of work.

This month’s article contains some pretty sweet repositories. There’s a project from NVIDIA which looks at video-to-video translations, a neat Google repository that makes reinforcement learning way easier to learn than ever before, and I’ve also included a useful automated object detection library. There’s a ton of more information below, including an entertaining R package.

In our Reddit section, we have diverse discussions ranging from multiple expert reviews of Julia to real-life data leakage stories. As a data scientist, you need to be on top of your game at all times, and that includes being updated with all the latest developments. Reddit, and AVBytes, should definitely be on your go-to list.

You can check out the top GitHub repositories and top Reddit discussions (from April onwards) we have covered each month below:

 

GitHub Repositories

NVIDIA’s vid2vid Technique

There has been tremendous progress in the image-to-image translation field. However the video processing field has rarely seen many breakthroughs in recent times. Until now.

NVIDIA, already leading the way in using deep learning for image and video processing, has open sourced a technique that does video-to-video translation, with mind-blowing results. They have open sourced their code on GitHub so you can get started with using this technique NOW. The code is a PyTorch implementation of vid2vid and you can use it for:

  • Converting semantic labels into realistic real-world videos
  • Creating multiple outputs for synthesizing people talking from edge maps
  • Generating a human body from a given pose (not just the structure, but the entire body!)

Check out our coverage of this repository here.

 

Dopamine by Google

If you’ve worked or researched in the field of reinforcement learning, you will have an idea of how difficult (if not impossible) it is to reproduce existing approaches. Dopemine is a TensorFlow framework that has been created and open sourced with the hope of accelerating progress in this field and making it more flexible and reproducible.

If you’ve been wanting to learn reinforcement learning but were scared by how complex it is, this repository comes as a golden opportunity. Available in just 15 Python files, the code comes with detailed documentation and a free dataset!

You can additionally read up on this repository here.

 

Automating Object Detection

Object detection is thriving in the deep learning community, but it can be a daunting challenge for newcomers. How many pixels and frames to map? How to increase the accuracy of a very basic model? Where do you even begin? You don’t need to fret too much about this anymore – thanks to MIT’s algorithm that automates object detection with stunning precision.

Their approach is called ‘Semantic Soft Segmentation (SSS)’. What takes an expert, say 10 minutes to manually edit, you can now do in a matter of seconds! The above image is a nice illustration of how this algorithm works, and how it’ll look when you implement it on your machine.

View our coverage of this technique in more detail here.

 

Human Pose Estimation

Pose estimation is seeing a ton of interest from researchers this year and publications like MIT have published studies marking progress in this field. From helping elderly people receive the right treatment to commercial applications like making a human virtually dance, pose estimation is poised to become the next best thing commercially.

This repository is Microsoft’s official PyTorch implementation of their popular paper – Simple Baselines for Human Pose Estimation and Tracking. They have offered baseline models and benchmarks that are good enough to hopefully inspire new ideas in this line of research.

 

Chorrrds

This one is for all the R users out there. We usually download R packages from CRAN so I personally haven’t felt the need to go to GitHub, but this package is one that I found very interesting. chorrrds helps you extract, analyze, and organize music chords. It even comes pre-loaded with several music datasets.

You can actually directly install it from CRAN, or use the devtools package to download it from GitHub. Find out more about how to do this, and more details, in this article.

 

Reddit Discussions

OpenAI Five Lose their First Professional Dota Game

In case you haven’t been following OpenAI in the last couple of months, their team has been hard at work trying to hype up their latest innovation – OpenAI Five. It’s a team of five neural network working together to become better at playing Dota. And these neural networks were doing extremely well, until they ran into the first professional Dota playing team.

This Reddit thread looks at the team’s defeat from all angles, and the machine learning perspective really stands out. Even if you haven’t read their research paper, this thread has enough information to get you up to speed in a jiffy. There are well over 100 comments on this topic, a truly knowledge-rich discussion.

 

A Different Perspective on using Notebooks for Machine Learning Tasks

Most of us in the data science and machine learning space have used Notebooks for various tasks, like data cleaning, model building, etc. I’m actually yet to meet someone who hasn’t used Notebooks at some point in their data science journey. We don’t usually question the limitations of these notebooks, do we?

Now here’s an interesting take on why Notebooks aren’t actually as useful as we think. Make sure you scroll through the entire discussion, there are some curious as well as insightful comments from fellow data scientists. And as a bonus, you can also check out the really well made presentation deck.

 

TensorFlow 2.0 is coming

TensorFlow 2.0 was teased a couple of weeks ago by Google and is expected to be launched in the next few months. This thread is equal parts funny and serious. TensorFlow users from around the world have given their take on what they are expecting, and what they want to see added. Quite a lot of comments are around the usefulness of Eager Execution.

This has been a long awaited update so big things are being expected. Will Google deliver?

 

Review of Julia for Machine Learning

The Julia programming language has been doing the rounds on social media lately after a few articles were written on how it might replace Python in the future. I’ve had requests to review the language and have directed everyone to this thread. What better place to check out the pros and cons of a programming language than a hardcore ML Reddit thread?

Rather than reading one perspective, you get access to multiple reviews, each adding a unique point of view. What I liked about this discussion was that plenty of existing Julia users have added their two cents. The consensus seems to be that it is showing a lot of promise (especially the latest release, Julia 1.0), but it has a while to go before it catches up with Python.

 

Data Leakage Stories in Real-World ML Projects

We are all caught up in trying to solve real-world problems that we tend to forget issues that might crop up in existing projects. You might be surprised at the kind of stories people have told here – including one where they had duplicate entries for one row, and that was making the model overfit the training data massively. There are some useful links as well for further reading on the kind of data leakage problems that have come up in the industry.

Have you ever been a victim of data leakage? Share your story in this Reddit thread and participate in the discussion!

 

End Notes

I thoroughly enjoy putting together this article every month as I trawl through hundreds of libraries and tens of Reddit discussion threads to bring the best of them to you. In this process I get to learn and try out tons of new techniques and tools.

Enjoy this month’s article and I hope you experiment with a few repositories mentioned above! In case you feel there are any other libraries or Reddit threads that the community should know about, let us know in the comments section below.

 

Pranav Dar 25 May 2020

Senior Editor at Analytics Vidhya. Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

Webtunix Solution
Webtunix Solution 19 Sep, 2018

Hi...Love your blog. I am really impress after reading your blog. Thanks for sharing this blog with us. Keep providing....!!!!

BlackBox Competition
BlackBox Competition 03 Oct, 2018

I like this blog very much.Its very interesting and amazing blog.Really Informative,Keep sharing. Thankx

Industrial Training in Chandigarh
Industrial Training in Chandigarh 17 Nov, 2018

Very informative article. keep up the good work.