6 Open Source Data Science Projects for Boosting your Resume

Pranav Dar 02 Mar, 2020 • 6 min read

Overview

  • Open-source data science projects are a great way to boost your resume
  • Try your hand at these 6 open source projects ranging from computer vision tasks to building visualizations in R

 

Looking for open-source data science projects?

Projects play a HUGE part in cracking data science interviews. I’ve personally taken over a hundred interviews in the last year and quite often, the final round comes down to the quality of these data science projects. This is especially relevant for newcomers and freshers in data science.

What kind of projects have you picked up? How did you perform on these projects? Did you beat the benchmark model? Did you experiment with the source code and build something different?

These are critical questions that might make or break your data science interview. I always encourage folks to take up a diverse range of data science projects and try to learn from that as much as possible.

open_source_data_science_projects

I will cover 6 such open-source data science projects in this article. I love putting this out at the start of every month (this is the 25th edition!). You’ll see a broad range of projects here, from performing computer vision tasks using MS Excel to drawing up a unique visualization in R.

You can check out the entire archive of open source data science projects here. And here’s the collection I picked out last month.

 

6 Open-Source Data Science Projects

Computer Vision using Deep Learning Open Source Projects

What’s the last MAJOR development you remember from the computer vision space? I’ve come across articles recently saying we’ve hit the proverbial deep learning wall – and there is no way up from there.

I respectfully disagree with this. There is a LOT more to uncover and unpack in deep learning (and computer vision in particular). If you’re wondering where I’m getting this level of confidence from, wait till you check out the below open-source computer vision using deep learning projects!

computer_vision_learning_path

There are more jobs in deep learning and computer vision than ever before. And that trend is likely to increase exponentially in 2020. Time to get on board and polish up your computer vision skills!

You should check out the below resources to get started with deep learning and computer vision:

 

Real-Time Person Removal using TensorFlow.js

Real-time object detection has really gathered pace in the last year or so. I love the different applications we can design using real-time object detection, such as tracking a football or a player during a game.

Now here’s a really cool Hollywood-level computer vision project – removing people from complex backgrounds in real-time using deep learning! The developers off this project have used TensorFlow.js to build their model.

Check out this example:

person_removal_tensorflow_open_source_project

This was done in real-time in a web browser! That’s the beauty of TensorFlow.js. The GitHub repository I’ve linked above contains the code to implement the project in your own machine.

Here are a couple of in-depth computer vision tutorials to get you started with these concepts:

 

Computer Vision Basics in Microsoft Excel

I love this open-source computer vision project! This one is for all the folks who have written off Excel as just a spreadsheet tool. The machine learning team at Amazon has come up with this rather cool project that shows us how to perform basic computer vision tasks in Microsoft Excel.

You can detect faces and find edges and lines using the tutorial provided in the project on GitHub. Here’s a quick look at what you’ll be building in Excel:

computer_vision_excel_open_source_projects

You don’t need any background in computer vision to work on this project. You will, however, need to know at least how a weighted average is calculated (and knowledge of Excel is required, of course).

So whether you’re a newcomer in deep learning and computer vision, or are coming from a software development background, this project is for you! Go ahead and try it out on your own machine and let me know about the crazy applications you build.

Here are a couple of resources to learn MS Excel:

 

Other Open Source Data Science Projects

Here are a few non-computer vision and non-deep learning projects I wanted to highlight. These cover a range of data science topics, from data visualization in R to the importance of software engineering in machine learning.

If you’re looking for a comprehensive, end-to-end course on machine learning, look no further!

 

ggbump – Data Visualization in R!

An R project! It’s a miracle! I’m a heavy R user and I love working with the wonderful ggplot2 library – but there haven’t been a lot of recent updates to report about. So I was thrilled when I came across ggbump last month.

ggbump is an R visualization package for, you guessed it, creating bump charts. Here’s an example of what you can draw using ggbump:

ggbump_open_source_project

Bump charts are typically used to compare two dimensions against each other using one measure value (all you Tableau folks will understand this!). The majority of use cases focus on exploring the changes in the rank of a value over time (like the bump chart above).

ggbump isn’t on CRAN yet but you can install it directly in R using the below command:

devtools::install_github("davidsjoberg/ggbump")

Here are a few resources to get you started with data visualization in R and Tableau:

 

The Goodreads Machine Learning Pipeline

I’m a bibliophile so naturally, Goodreads is my go-to platform for anything related to books. I rely on it heavily for recommendations, book reviews, and much more.

So imagine my joy when I came across this awesome project on GitHub! This is an end-to-end Goodreads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

goodreads_pipeline_open_source_project

The Goodreads Machine Learning pipeline consists of the below modules:

  • GoodReads Python Wrapper
  • ETL Jobs
  • Redshift Warehouse Module
  • Analytics Module

I encourage you to go through the below tutorial on building your own machine learning pipeline using sklearn:

 

Graph Neural Networks in TensorFlow 2.0

This is a fascinating project. Graphs can appear to be daunting at first, but once you get an idea of how they work, you’ll love working with them.

Graph neural networks (GNN) are behind applications like social media network analysis, knowledge trees, recommendation systems, and much more.

The GitHub repository I’ve linked above provides the implementation of various flavors of graph neural networks in TensorFlow 2.0. You have a few training script examples in the repository as well to get you on your way.

You can install the Python library from pip:

pip install tf2_gnn

I’ve provided resources below to help you understand the various concepts behind graph neural networks:

 

Awesome Software Engineering for Machine Learning

Software engineering is a very under-rated part of the machine learning pipeline. Experts don’t discuss it, courses don’t usually cover it, and data science aspirants don’t study about it.

And yet, when you sit for a data science interview, you’ll inevitably face a ton of software engineering questions. How do you set up a machine learning pipeline? What is model deployment? And so on.

software_engineering_machine_learning

This wonderful repository offers a curated list of tutorials that cover software engineering best practices for building machine learning applications. Here’s what the repository currently covers:

  • Broad Overview of Software Engineering in Machine Learning
  • Data Management
  • Model Training
  • Deployment and Operation
  • Social aspects
  • Tooling

Trust me, software engineering is a must-have skill in your data scientist’s resume. You need to get on board with this and start picking up these skills.

 

End Notes

My pick of the above open-source projects:

  • Computer Vision using Microsoft Excel
  • ggbump in R

I’ve already started working on these two on my own and would be happy to share the progress and code with the community! Let me know in the comments section below which project you’ll be picking up this month.

Pranav Dar 02 Mar 2020

Senior Editor at Analytics Vidhya. Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

Related Courses