Don’t Miss these 5 Data Science GitHub Projects and Reddit Discussions (April Edition)

Last Updated : 27 Apr, 2020

7 min read

Introduction

Data science is an ever-evolving field. As data scientists, we need to have our finger on the pulse of the latest algorithms and frameworks coming up in the community.

I’ve found GitHub to be an excellent source of knowledge in that regard. The platform helps me stay current with trending data science topics. I can also look up and download code from leading data scientists and companies – what more could a data scientist ask for? So, if you’re a:

Data science enthusiast
Machine learning practitioner
Data science manager
Deep learning expert

or any mix of the above, this article is for you. I’ve taken away the pain of having to browse through multiple repositories by picking the top data science ones here. This month’s collection has a heavy emphasis on Natural Language Processing (NLP).

I have also picked out five in-depth data science-related Reddit discussions for you. Picking the brains of data science experts is a rare opportunity, but Reddit allows us to dive into their thought process. I strongly recommend going through these discussions to improve your knowledge and industry understanding.

Want to check out the top repositories from the first three months of 2019? We’ve got you covered:

Let’s get into it!

Data Science GitHub Repositories

Sparse Transformer by OpenAI – A Superb NLP Framework

What a year this is turning out to be for OpenAI’s NLP research. They captured our attention with the release of GPT-2 in February (more on that later) and have now come up with an NLP framework that builds on top of the popular Transformer architecture.

The Sparse Transformer is a deep neural network that predicts the next item in a sequence. This includes text, images and even audio! The initial results have been record-breaking. The algorithm uses the attention mechanism (quite popular in deep learning) to extract patterns from sequences 30 times longer than what was previously possible.

Got your attention, didn’t it? This repository contains the sparse attention components of this framework. You can clone or download the repository and start working on an NLP sequence prediction problem right now. Just make sure you use Google Colab and the free GPU they offer.

Read more about the Sparse Transformer on the below links:

OpenAI’s GPT-2 in a Few Lines of Code

Ah yes. OpenAI’s GPT-2. I haven’t seen such hype around a data science library release before. They only released a very small sample of their original model (owing to fear of malicious misuse), but even that mini version of the algorithm has shown us how powerful GPT-2 is for NLP tasks.

There have been many attempts to replicate GPT-2’s approach but most of them are too complex or long-winded. That’s why this repository caught my eye. It’s a simple Python package that allows us to retrain GPT-2’s text-generating model on any unseen text. Check out the below-generated text using the gpt2.generate() command:

You can install gpt-2-simple directly via pip (you’ll also need TensorFlow installed):

pip3 install gpt_2_simple

NeuronBlocks – Impressive NLP Deep Learning Toolkit by Microsoft

Another NLP entry this month. It just goes to show the mind-boggling pace at which advancements in NLP are happening right now.

NeuronBlocks is an NLP toolkit developed by Microsoft that helps data science teams build end-to-end pipelines for neural networks. The idea behind NeuronBlocks is to reduce the cost it takes to build deep neural network models for NLP tasks.

There are two major components that makeup NeuronBlocks (use the above image as a reference):

BlockZoo: This contains popular neural network components
ModelZoo: This is a suite of NLP models for performing various tasks

You know how costly applying deep learning solutions can get. So make sure you check out NeuronBlocks and see if it works for you or your organization. The full paper describing NeuronBlocks can be read here.

CenterNet – Computer Vision using Center Point Detection

I really like this approach to object detection. Generally, detection algorithms identify objects as axis-aligned boxes in the given image. These methods look at multiple object points and locations and classify each. This sounds fair – that’s how everyone does it, right?

Well, this approach, called CenterNet, models an object as a single point. Basically, it identifies the central point of any bounding box using keypoint estimation. CenterNet has proven to be much faster and more accurate than the bounding box techniques we are familiar with.

Try it out next time you’re working on an object detection problem – you’ll love it! You can read the paper explaining CenterNet here.

BentoML – Toolkit for Deploying Models!

Understanding and learning how to deploy machine learning models is a MUST for any data scientist. In fact, more and more recruiters are starting to ask deployment-related questions during data scientist interviews. If you don’t know what it is, you need to brush up right now.

BentoML is a Python library that helps you package and deploy machine learning models. You can take your model from your notebook to the production API service in 5 minutes (approximately!). The BentoML service can easily be deployed with your favorite platforms, such as Kubernetes, Docker, Airflow, AWS, Azure, etc.

It’s a flexible library. It supports popular frameworks like TensorFlow, PyTorch, Sci-kit Learn, XGBoost, etc. You can even deploy custom frameworks using BentoML. Sounds like too good an opportunity to pass up!

This GitHub repository contains the code to get you started, plus installation instructions and a couple of examples.

Data Science Reddit Discussions

What Role do Tools like Tableau and Alteryx Play in a Data Science Organization?

Are you working in a Business Intelligence/MIS/Reporting role? Do you often find yourself working with drag-and-drop tools like Tableau, Alteryx, Power BI? If you’re reading this article, I’m assuming you are interested in transitioning to data science.

This discussion thread, started by a slightly frustrated data analyst, dives into the role a data analyst can play in a data science project. The discussion focuses on the skills a data analyst/BI professional needs to pick up to stand any chance of switching to data science.

Hint: Learning how to code well is the #1 advice.

Also, check out our comprehensive and example-filled article on the 11 steps you should follow to transition into data science.

Lessons Learned During Move from Master’s Degree to the Industry

Source: jobs.ie

The biggest gripe hiring data science managers have is the lack of industry experience candidates bring. Bridging the gap between academia and industry has proven to be elusive for most data science enthusiasts. MOOCs, books, articles – all of these are excellent sources of knowledge – but they don’t provide industry exposure.

This discussion, starting from the author’s post, is gold fodder for us. I like that the author has posted an exhaustive description of his interview experience. The comments include on-point questions that probe out more information on this transition.

When ML and Data Science are the Death of a Good Company: A Cautionary Tale

The consensus these days is you can use machine learning and artificial intelligence to improve your organization’s bottom line. That’s what management feed leadership and that brings in investment.

But what happens when management doesn’t know how to build AI and ML solutions? And doesn’t invest in first setting up the infrastructure before even thinking about machine learning? That part is often overlooked during discussions and is often fatal to a company.

This discussion is about how a company, chugging along using older programming languages and tools, suddenly decides to replace its old architecture with flashy data science scripts and tools. A cautionary tale and one you should pay heed to as you enter this industry.

Have we hit the Limits of Deep Reinforcement Learning?

I’ve seen this question being asked on multiple forums recently. It’s an understandable thought. Apart from a few breakthroughs by a tech giant every few months, we haven’t seen a lot of progress in deep reinforcement learning.

But is this true? Is this really the limit? We’ve barely started to scratch the surface and are we already done? Most of us believe there’s a lot more to come. This discussion hits the right point between the technical aspect and the overall grand scheme of things.

You can apply the lessons learned from this discussion to deep learning as well. You’ll see the similarities when the talk turns to deep neural networks.

What do Data Scientists do on a Day-to-Day Basis?

Ever wondered what a data scientist spends most of their day on? Most aspiring professionals think they’ll be building model after model. That’s a trap you need to avoid at any cost.

I like the first comment in this discussion. The person equates being a data scientist to being a lawyer. That is, there are different kinds of roles depending on which domain you’re in. So there’s no straight answer to this question.

The other comments offer a nice perspective of what data scientists are doing these days. In short, there’s a broad range of tasks that will depend entirely on what kind of project you have and the size of your team. There’s some well-intentioned sarcasm as well – I always enjoy that!

End Notes

I loved putting together this month’s edition given the sheer scope of topics we have covered. Where computer vision techniques have hit a ceiling (relatively speaking), NLP continues to break through barricades. Sparse Transformer by OpenAI seems like a great NLP project to try out next.

What did you think of this month’s collection? Any data science libraries or discussions I missed out on? Hit me up in the comments section below and let’s discuss!

Beginner Data Science Deep Learning Github Listicle

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Deep Learning

Feed Forward Networks

Feed Forward Networks

Gradient Descent

Loss Function

Activation Functions

Introduction to Neural networks

Forward and Backward Propagation

Optimizers

Learning Rate Schedulers

NN on Structured Data

Improving the Deep Learning Model

Deep Learning Model Optimization

Unsupervised Deep Learning

AutoDL

Model Deployment

Introduction to PyTorch

Don’t Miss these 5 Data Science GitHub Projects and Reddit Discussions (April Edition)

Introduction

Data Science GitHub Repositories

Data Science Reddit Discussions

End Notes

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID