GitHub and Reddit both serve as interesting discovery platforms for me. I not only learn some of the best applications of data science, but also see how they have been written and will hopefully be contributing to some of these repositories in the near future.
GitHub was acquired by Microsoft recently in a multi-billion dollar deal. GitHub has been the ultimate platform for collaboration between developers and we have seen the data science and machine learning community embrace it with equal enthusiasm. We hope this continues under Microsoft’s umbrella as well.
As for Reddit, it continues to be a wonderful source of knowledge and opinion for data scientists. People share links to their code, other people’s codes, general data science news, ask for help and opinions, post research papers, among other things. It’s a truly powerful community that continues to provide a solid platform for interacting with fellow data science enthusiasts.
We saw a few great Reddit discussions in May, including the role of data scientists in the next 3 years and a collection of the best ML papers ever written. In the GitHub community, Intel open sourced it’s NLP architect library, Microsoft unveiled ML.NET to enable machine learning for Dot Net developers, etc.
Let’s dive into the list and look at the top repositories on GitHub and intriguing discussions on Reddit that occurred last month.
You can check out the top GitHub repositories and top Reddit discussions (from April onwards) for the last four months below:
ML.NET is an open source machine learning framework which aims to make ML accessible to, you guessed it, .NET developers. It enables them to develop their own models in .NET, all without requiring prior experience in building machine learning models. This is currently a preview release and includes basic classification and regression algorithms.
ML.NET was originally created by Microsoft and has been used across it’s wide range of products, like Windows, Excel, Access, Bing, etc. This release also comes bundled with .NET APIs for various model training model tasks.
NLP Architect is an open source Python library that enables data scientists to explore state-of-the-art deep learning techniques in the field of natural language processing (NLP) and natural language understandings (NLU). It has been developed and open sourced by researchers at Intel Lab.
One of my favorite components of this library is a visualization component that shows the model’s annotations in a tidy and neat fashion. Check out our coverage of NLP Architect here.
- High processing speed
- There is no need for image preprocessing prior to detection
- There is no need for the computation of integral images, image pyramid, HOG pyramid or any other similar data structure
- The face detection is based on pixel intensity comparison encoded in the binary file dat tree structure
This one is for all the reinforcement learning (RL) enthusiasts. Deep learning has propelled RL to program an AI to play Atari games at human expert level skill. This repository covers interesting new extensions to the policy gradient algorithm, one of the favorite default choices for solving RL problems. These extensions have led to an improvement in training time as well as the overall performance of reinforcement learning.
This thread took off as soon as the author posted the above concept in video form. It is a fascinating concept and to see it come alive using deep learning is a wonderful thing. It caught the attention of data scientists and ML enthusiasts as you can tell by the amount of questions in the thread. I encourage you to scroll through them all and you will get a very good idea of how this technology was implemented.
If you’re new to machine learning, or are looking for papers to read or refer, this is magnificent thread. There are some excellent machine learning research papers mentioned in this thread which every data scientist, aspiring or established, will hugely benefit from. The thread contains papers ranging from basic machine learning concepts like Gaussian models, to advanced concepts like neural artistic style transfer, rapid object detection using a boosted cascade of simple features, etc. This is essentially a MUST-READ.
Generalization in deep learning has been a topic of constant debate. As the author of this post mentioned, there are still quite a few scenarios where we struggle to achieve any generalization at all. This led to a deep discussion around the current state of generalization and why it has been so hard to understand in deep and reinforcement learning. This discussions includes lengthy posts which can get a little complex if you’re new to this field. However I suggest reading through them anyway because these are opinions of some highly experienced and knowledgeable data scientists.
This thread delves into the current state of machine learning specifically in the healthcare industry (and not in research areas). Data scientists in this industry have shared their experience and opinions about what they have seen in their work. Refer to this thread whenever anyone asks you anything about ML and DL in the life sciences domain!
This is a very pertinent question most people ask before getting into this field. With the rapid adoption of automated machine learning tools, will companies even need data scientists in a few years? This thread is a collection of opinions about where different people in data science see this role expanding or diversifying in the next few years. There is some excellent career advice in here so make sure you check this out.
Some of the above Reddit discussions were really insightful, like the healthcare industry and the generalization thread. I personally love curating these GitHub repositories and Reddit discussions threads because it gives me a high level overview of all that’s happening in the ML research community.
Which repository and/or discissions did you find the most interesting out of these? Get involved and let us know in the comments section below!