Pranav Dar — June 3, 2020
Beginner Career Github Listicle

Overview

  • Utilize this time and work on your data science resume with these top open-source projects
  • From Facebook AI’s computer vision framework to OpenAI’s GPT-3 model, we cover a broad range of open source data science projects

 

Introduction

“How many data science projects have you completed so far?”

This is a very common question interviewers ask in data science interviews. I have conducted several of these interviews for both data analyst and data scientist roles and this is quite often the jackpot question. This is especially true if you’re a fresher or a relative newcomer to data science.

Just doing courses or attaining certifications isn’t good enough. Almost everyone I know holds certifications in various aspects of data science. It adds no value to your resume if you don’t combine it with practical experience.

And that’s where open-source data science projects play such a key role. Interviews love applicants who pick up these projects and come up with solutions. This shows your curiosity, passion, and enthusiasm for the field. Trust me, adding data science projects to your resume will prop up your chances of getting hired.

github_open_source_data_science_projects

But which data science projects should you pick up? I love collecting the best projects from the previous months and bringing them to you. In this month’s edition, we’ll cover a broad range of topics, from Facebook AI’s game-changing DEtection TRansformer (DETR) framework to OpenAI’s GPT-3.

You can check out our popular ‘Getting Started with GitHub‘ guide if you’re new to this platform. And also make sure you go through our previous open-source data science projects (more than 100 projects!).

 

Open Source Data Science Projects to Enhance your Resume and Application

github_open_source_data_science_projects

Facebook AI’s DEtection TRansformer (DETR)

DETR by Facebook AI is easily the most intriguing open-source project released in May. The fact that it has accumulated almost 3,000 stars within a week is quite telling.

DETR, short for DEtection TRansformer, could be a change changer in the computer vision space. This framework is an innovative and efficient approach to solve object detection problems. And DETR is supremely fast and extremely efficient – a dream for data science professionals!

Detection transformer

As our resident data scientist Prateek Joshi puts it:

“The DETR model is quite simple and you don’t have to install any library to use it. DETR treats an object detection problem as a direct set prediction problem with the help of an encoder-decoder architecture based on transformers.”

We have covered DETR in detail here to help you understand how it works underneath and how you can use it for object detection tasks. You can also check out the Colab notebook the Facebook AI team has released to see the DETR model in action.

 

Real-Time Image Animation

Another fascinating open-source computer vision project. This, as the name suggests, let’s us perform image animation in real-time using OpenCV. Check out this example I’ve taken from the project’s GitHub repository:

real_time_animation_open_source

The model mimics the expression of the person in front of the camera and changes the image accordingly. It’s a brilliant use of computer vision and a project we’ll be trying out internally for sure. This kind of project will have a ton of applications in the industry, from fashion and retail to marketing and advertising.

You would need to know how PyTorch works if you’re interested in implementing this on your own. Go ahead and read our getting started with PyTorch guide to dip your toes in the water. This, by the way, will add a lot of shine to your data science resume and impress your interviewers.

The original developer has been kind enough to open source the code as well as the Colab notebook. Go ahead and experiment to your heart’s desire. That’s the best way to learn!

 

OpenAI’s GPT-3 – A Massive NLP Release!

OpenAI has done it again! After releasing GPT-2 last year and whipping up a media frenzy around it, they have open-sourced their latest Natural Language Processing (NLP) framework – GPT-3!

Simply put, GPT-3 is the largest NLP model of it’s kind. It has 175 billion parameters (yes, you read that correctly) and is HUGE in terms of size, almost 350GB. GPT-3 is almost one of the costliest models in history (took approximately $12 million to train).

openai_gpt2

It’s no secret that language models require a lot of data to train on tasks that humans can pick up in seconds. Step up – GPT-3. In the official paper that talks about how GPT-3 works under the hood, OpenAI showcase how scaling up language models greatly improves task-agnostic and few-shot performances.

Now here’s the part that might concern a lot of data science ethics folks – GPT-3 can easily generate samples of news articles that humans will struggle to identify as fake news. In today’s interconnected world, that could be potentially disastrous. To be fair to OpenAI, they have addressed this issue in their paper.

 

Real-Time Audio Analysis using PyAudio

This open-source data science project is a personal favorite. Created and released by Xander Steenbrugge, esteemed speaker at the previous two DataHack Summits, this Python library enables us to perform real-time audio analysis.

audio_analysis_open_source_projects

As Xander puts it in his GitHub repository, this is:

“A simple package to do realtime audio analysis in native Python, using PyAudio and Numpy to extract and visualize FFT features from a live audio stream.”

FFT here stands for Fast-Fourier Transform. It is a brilliant tool to have in your data science skillset as it unlocks a wide range of problems you can work with. I encourage you to check out more about FFT here.

We’ll be trying out PyAudio and Xander’s work at Analytics Vidhya for sure. A lot of our data science members are heavy music listeners and they can’t wait to sink their teeth into this open-source project.

If you haven’t worked with audio data before, go through the below article to learn all about it:

 

TextShot – An Awesome Python Tool for Grabbing Text

Have you ever come across images or screenshots that had text but couldn’t quite extract that text? I’m aware of a few tools that exist for this purpose but I’d rather not install any additional software on my machine!

Now, we can simply use this Python tool to grab screenshots and extract text from them, Called TextShot (nice name), this is an excellent tool to quickly gather any text data we require for our data science projects. Here’s a demo of how TextShot works:

textshot_python_open_source_project

TextShot requires you to install Google’s Tesseract on your machine. You can check out the below tutorial to learn more about how Tesseract works:

 

Machine Learning Visuals – A Brilliant Way to Communicate for Data Science Professionals

I love this open-source repository by dair.ai. A lot of newcomers (and even experienced heads) often struggle with technical and scientific communication. There is a nuance to handling scientific communication that a lot of people miss.

ML Visuals is an open-source collaborative effort to help the data science community understand and improve technical communication. This brilliant repository provides a lot of visuals, templates, and figures to help you build a perfect presentation or research paper.

The best part of this project is that you can find everything under one umbrella on Google Slides. Check out a couple of visuals I’ve taken from these slides:

ML_Visuals_open_source_projects

ML_Visuals_open_source_projects_1

Pretty neat stuff! If you do use anything from this project, please give credit to the developers. You should also check out this excellent article on storytelling in analytics and data science to learn more about communication skills.

 

End Notes

A lot of intriguing open-source data science projects in this month’s collection! Our entire team at Analytics Vidhya is either working on Facebook AI’s DETR or OpenAI’s GPT-3 models. Both offer a lot of promise in their respective fields.

Is there any project you feel the community should know about? Highlight that in the comments section below and let’s get everyone together to solve it!

About the Author

Pranav Dar

Senior Editor at Analytics Vidhya. Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

15 thoughts on "6 Open Source Data Science Projects to Impress your Interviewer"

Michael Bukatin
Michael Bukatin says: June 03, 2020 at 1:27 pm
GPT-3 is not open-sourced. There is no model and no code in https://github.com/openai/gpt-3 That repo just has a bit of data relevant for the paper, but the important stuff remains private to OpenAI. Reply
Pranav Dar
Pranav Dar says: June 03, 2020 at 3:58 pm
Hey Michael, Thanks for reading the article. Yes, but they have open sourced a few key things, enough to get a taste of what the model is. They can't open source the entire thing because, 1) They use it as a marketing strategy like they did for GPT-2, and 2) There is genuine concern that it will be used for malicious purposes. I think the entire point of putting this out is to help the NLP community understand where the research is heading and what OpenAI is doing toward that end. Thanks, Pranav Reply
Gwyra Bebe Pimentel
Gwyra Bebe Pimentel says: June 03, 2020 at 4:55 pm
Grato! Reply
Pranav Dar
Pranav Dar says: June 03, 2020 at 7:23 pm
Glad you found it useful, Gwyra. Reply
Michael Bukatin
Michael Bukatin says: June 03, 2020 at 8:31 pm
Right. It’s just that we can’t use it yet, the GPT-3 paper was published just to inform us. Whereas it’s easy to use GPT-2 as is, and not too difficult to fine-tune it, etc. In this sense, it’s GPT-2 which is more like the other projects listed in your article: ready for our use. Reply
Pranal
Pranal says: June 03, 2020 at 9:00 pm
All rich domain projects briefed here.. very nice.. Reply
Nidhi
Nidhi says: June 04, 2020 at 1:14 am
Where should a beginner start from, for making a data science project? Kindly help! Reply
Praveen Sharma
Praveen Sharma says: June 04, 2020 at 7:31 am
R u working on Facebook AI detection transformer? Reply
Mahtab
Mahtab says: June 04, 2020 at 9:11 am
Very nice Post, great content to share with us thank you. Reply
MEMO
MEMO says: June 04, 2020 at 9:51 am
I watched your article published on Google read and I liked what you write so much the projects the fields the purposes of these tools to have a better experience in writing and making new project will be very useful for any one week to have a good interview after done working on these projects and ideas thank you very much for opening my mind to this kind of projects and fields I never read or even know about it at all before . I have never been having any experience in any of these projects fields before and I want you to show me and anyone reading these words and at same situation of having any experience before how could anyone join and start having good experience to work on projects you mentioned at this article that will be like a plan road for me and for anyone to getting started on the same steps you took to reach today's experience that help you at your practical life today and helped you to be a better person where ever you are I would like to hear a lot from you and thank you again for this article Reply
Pranav Dar
Pranav Dar says: June 04, 2020 at 11:21 am
Thanks, Mahtab! Reply
Pranav Dar
Pranav Dar says: June 04, 2020 at 11:22 am
Hi Praveen, Our team has already written an article about it and are exploring it currently. Link to the article is here: https://www.analyticsvidhya.com/blog/2020/05/facebook-detection-transformer-detr-a-transformer-based-object-detection-approach/ Reply
Pranav Dar
Pranav Dar says: June 04, 2020 at 11:23 am
Glad you found it useful, Pranal! Reply
Pranav Dar
Pranav Dar says: June 04, 2020 at 11:24 am
Hi Nidhi, You can check out the courses we have here: https://courses.analyticsvidhya.com/ These are comprehensive end-to-end courses plus beginner friendly ones to get you started. This is a good starting point as well: https://courses.analyticsvidhya.com/courses/a-comprehensive-learning-path-to-become-a-data-scientist-in-2020 Thanks, Pranav Reply
Kayala Viswanath Reddy
Kayala Viswanath Reddy says: June 09, 2020 at 8:40 pm
It's useful Reply

Leave a Reply Your email address will not be published. Required fields are marked *