6 Open Source Data Science Projects That Provide an Edge to Your Portfolio

Abhiraj Suresh Last Updated : 07 Dec, 2023

5 min read

Introduction

“I understand the concepts well. Why should I focus on data science projects in my data science journey?”

I have been in the data science industry for more than a year now and this question is one of the most asked ones in the data science journey. This is especially true if are at the beginning stage of your journey. Personally speaking, the existence of this question is plainly immoral.

In the 21st century, there is not a single domain in the world that does not expect the candidate to have some form of self-practice that portrays his/her interest, understanding, and skill. The same is true for data science.

Data Science projects are the best way to showcase to the world your understanding of the topic. The projects you do are a manifestation of your programming skills, knowledge acquired and structured thinking. And let me tell you a little secret- “The data science projects you do serve as the key to unlock the tricky door, called the interview.”

With the importance of data science piquing more than ever, we bring to you 6 open source data science projects published last month that can give your portfolio an edge over the others.

The best way to make the most of your data science journey is to choose the right course, having the right kind of mentorship, and industry-relevant projects to make you industry-ready. Check-out our well-curated Certified AI & ML BlackBelt Plus Program.

Introduction
Open Source Data Science Projects to Enhance your Portfolio
Conclusion
Frequently Asked Questions

Open Source Data Science Projects to Enhance your Portfolio

Let us divide the projects into categories.

Open Source Computer Vision Projects

FaceX-Zoo

FaceX-Zoo has to be one of the most impressive projects of the month. With face recognition becoming more and more relevant in the realm of computer vision FaceX-Zoo is an open-source data science project you do not want to miss.

FaceX-Zoo is a face recognition PyTorch toolbox. It comes with a training module having different supervisory heads and backbones towards state-of-the-art face recognition. It has a standardized evaluation module, enabling the evaluation of models in most of the popular benchmarks just by editing a simple configuration.

Also, a simple yet fully functional face SDK is provided for the validation and primary application of the trained models. Also, FaceX-Zoo easily upgrades and extends along with the development of face-related domains.

Open Source Data Science Projects FaceX-Zoo

Bottleneck Transformer – Pytorch

Another mind-blowing project in computer vision, Bottleneck Transformer looks like a very good project to add to your data science portfolio.

The paper says-

“It is simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection, and instance segmentation”

Baseline models see significant improvement by simply replacing the last 3 bottleneck blocks of a ResNet and no other changes. Sounds promising, doesn’t it?

The Bottleneck transformer has all the potential to serve as a strong baseline for future research in self-attention models for vision.

Open Source Data Science Projects bottleneck

StyleGAN2-ADA — Official PyTorch implementation

When generative adversarial networks are trained using too small data, it may end up in discriminator overfitting, causing training to diverge. This project comes with a solution by including an adaptive discriminator augmentation mechanism that can stabilize training in limited data regimes.

The project come with a lot of promises including-

Full support for all primary training configurations.
Extensive verification of image quality, training curves, and quality metrics against the TensorFlow version.
Results are expected to match in all cases, excluding the effects of pseudo-random numbers and floating-point arithmetic.

With increased speed and efficiency as compared to other projects, StyleGAN2-ADA is a nice open-sourced project to add to your portfolio.

Open Source Natural Language Processing Projects

Trankit

The fascinating world of NLP is not far behind when it comes to impressive open-sourced data science projects. Trankit is another popular project released last month.

Trankit is a light-weight transformer-based python toolkit for multilingual Natural Language Processing. Its 2 main constituents include-

A trainable pipeline for fundamental NLP tasks over 100 languages
90 downloadable pretrained pipelines for 56 languages

Another impressive thing about Trankit is that it beats the current state-of-the-art multilingual toolkit Stanza (StanfordNLP) in many tasks over 90 Universal Dependencies v2.5 treebanks of 56 different languages without losing efficiency in memory usage and speed, making it usable amongst a larger audience.

EasyNMT – Easy to use, state-of-the-art Neural Machine Translation

With Easy installation, usage, and Automatic download of pre-trained machine translation models, EasyMNT will easily make your NLP portfolio stand out.

It has translation between 150+ languages and automatic language detection for 170+ languages along with sentence and document translation.

At present, the project provides the following models-

Open Source Machine Learning Project

SeaLion

SeaLion is a brilliant Machine Learning Project created to teach the concepts in a more easy manner using concise algorithms capable of doing the tasks efficiently.

SeaLion is designed to teach today’s aspiring ml-engineers the popular machine learning concepts of today in a way that gives both intuition and ways of application.

It is beginner-friendly when it comes to solving the standard libraries like iris, breast cancer, swiss roll, the moons dataset, MNIST, etc. The algorithms in SeaLion include-

Deep Neural Networks
Regression
Dimensionality Reduction
Unsupervised Clustering
Naive Bayes
Trees
Ensemble Learning
Nearest Neighbors
Utils

Conclusion

Wow– that’s a lot of projects. My aim, as always, was to keep the projects as diverse as possible so you can pick the ones that fit into your data science journey. If you’re just beginning, I would suggest starting with the SeaLion project. A great chance to get a head start.

I would love to hear your thoughts on which open source project you found the most useful. Or let me know if you want me to feature any other data science projects here or in next month’s edition.

Frequently Asked Questions

Q1.What are the top open-source projects?

Some well-known open-source projects are like big, shared puzzles that people worldwide work on together. Examples include Linux, Python, TensorFlow, and Apache Kafka.

Q2.How do I find open-source projects?

Finding open-source projects is like discovering cool stuff on the internet. You can look on websites like GitHub or GitLab. There, you can use search tools or explore what’s popular. Joining online groups where people talk about these projects is another way to find them.

Q3. Can we copy open-source projects?

Yes, you can copy open-source projects, but it’s like borrowing a book from the library. These projects have rules (licenses) that say how to use them. Usually, you can copy, change, and share the code, but you must follow the rules in the project’s license. Always give credit to the original creators!

Abhiraj Suresh

My name is Abhiraj. I am currently a manager for the Instruction Design team at Analytics Vidhya. My interests include badminton, voracious reading, and meeting new people. On a daily basis I love learning new things and spreading my knowledge.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

6 Open Source Data Science Projects That Provide an Edge to Your Portfolio

Introduction

Table of contents