Five data science projects to learn data science

Kunal Jain Last Updated : 27 Aug, 2021

4 min read

Overview

It is important to actually work on different kinds of data and projects along with learning the data science concepts
Some datasets are very popular and a lot more are easily available on the web

For a more exhaustive list of datasets and data science projects, please refer to this more recent article :

24 Ultimate Data Science Projects To Boost Your Knowledge and Skills

Nothing beats the learning which happens on the job!

Whether it is the challenges you face while collecting the data or cleaning it up, you can only appreciate the efforts, once you have undergone the process.

Hence, the best way to learn Data Science is to do Data Science. There is no substitute to it.

It doesn’t matter whether you are using R or Python or Weka – the best approach to learn data science is to learn the basics of the tool you are using (e.g. How is data stored? How can you access specific data points? How to make data manipulations? etc.) and then just start working on a data science problem / project.

In order to help you learn data science, I have listed some of the datasets I recommend, along with the reason, why I have included them in the mix. All these datasets are available for free over the internet and provide a glimpse of how data science is changing the world, we live in.

These datasets would appeal to you, irrespective of the fact whether you are a newbie or a pro. Here are 5 datasets and the reasons why I recommend them:

Titanic dataset from Kaggle: This is the first dataset, I recommend to any starter and for a good reason – the problem looks simple at the outset. Yet, it provides a good understanding of what a typical data science project involves. The starters can work on the dataset in excel and the pros can work on advanced tools to extract hidden information and algorithms to substitute some of the missing values in the dataset. Another cool aspect is that you can rank yourself against other data scientists on Kaggle to see where you stand. This dataset is just the introduction you need, before you delve into the world of Kaggle.
- References:
  - Performing exploratory analysis using Pandas
  - Data Munging using Pandas
Learning to mine twitter on a topic: This project is included in the list, so that beginners can correlate to the power of data science. With help of twitter and a good data science tool, you can find out what the world is saying about a particular topic. I was mesmerized by this, when I did this for the first time. Be it reviews about movies, sentiments about elections or any hot topic off the press – you can know what the people are saying by yourself. Performing this exercise not only helps you understand some of the challenges in mining social media (especially, if you are interested in text mining), it also helps you understand how easy it is to integrate an API in your scripts to access the information available on social media.
- Reference: Who is the world cheering for?
Human activity recognition using smartphone dataset: This problem makes into the list because it is a segmentation problem (different to the previous 2 problems) and there are various solutions available on the internet to aid your learning. It is an interesting application, if you have ever wondered how does your smartphone know what you are doing right now. Another reason to solve this problem is that it helps you understand a different kind of problem – one where there are no missing values (because the collection is happening in automated manner), so the focus is on data munging and learning.
Hubway Visualization challenge: This problem focuses on data visualization and not prediction / machine learning explicitly (No one stops you from applying those though). The questions mentioned in the challenge help understand the challenges a business can solve with help of Business Intelligence tools. Again, there are bunch of interesting visualizations available on the internet to see what some of the best minds have produced.
Movielens data: I couldn’t have left this data set out. Bigger than some of the other data sets mentioned in the article, but provides a lot of fun. The dataset is sufficient to build a recommender system and see which movies are liked by what kind of audience.
1. Readme file: http://files.grouplens.org/papers/ml-1m-README.txt
2. Dataset: http://www.grouplens.org/system/files/ml-1m.zip

These are the five datasets, I recommend to people starting in the industry. They provide a healthy mix of different types of challenges you face as a data scientist. Each of these datasets provide a bunch of learning and would probably leave you wanting for more.

If you are aware of other open datasets, which you recommend to people starting their journey on data science, please feel free to suggest them along with the reasons, why they should be included. If the reason is good, I’ll include them in the list.

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Kunal Jain

Kunal Jain is the Founder and CEO of Analytics Vidhya, one of the world's leading communities of Al professionals. With over 17 years of experience in the field, Kunal has been instrumental in shaping the global Al landscape. His expertise spans diverse markets, from developed economies like the UK to emerging ones like India, where he has successfully led and delivered complex data-driven solutions. As a recognized thought leader, Kunal has empowered countless individuals to realize their Al ambitions through his visionary approach to Al education and community building. Before founding Analytics Vidhya, Kunal earned both his undergraduate and postgraduate degrees from IIT Bombay and held key roles at Capital One and Aviva Life Insurance across multiple geographies. His passion lies at the intersection of analytics, Al, and fostering a thriving community of data science professionals.

Free Courses

4.6

Exploratory Data Analysis with Python & GenAI

Learn EDA with Python: Transform data into insights using PandasAI & more.

4.5

Data Science Course

Build a powerful 2026-ready data science resume using AI tools.

4.5

No Code Predictive Analytics with Orange

No-code AI course for business pros with real-world ML use cases.

4.7

How to Build an Image Generator Web App with Zero Coding

Learn to build an image generator web app with zero coding skills.

4.7

Adaptive Email Agents with DSPy

Build adaptive email agents with DSPy using context and smart learning.

Sudhindra

Awesome post...for any starter like me!!

Love Shah

Awesome stuff! Thanks for sharing.

atul

Hi Kunal, I am a big fan of your work and I am very grateful to you & Analytics Vidhya for sharing all these useful information. Also please start something on recommender system or if you have any music related content dataset to make a content recommending system. It will be great help. thanks and regards, Atul Rawat

Show 1 reply

Ankit Choudhary

Try Foundations of machine learning from coursera. One of the assignments has recommender system for music and artists along with dataset