Five data science projects to learn data science

Kunal Jain 27 Aug, 2021 • 4 min read

Overview

  • It is important to actually work on different kinds of data and projects along with learning the data science concepts
  • Some datasets are very popular and a lot more are easily available on the web

 

For a more exhaustive list of datasets and data science projects, please refer to this more recent article :

24 Ultimate Data Science Projects To Boost Your Knowledge and Skills

 

Nothing beats the learning which happens on the job!

Whether it is the challenges you face while collecting the data or cleaning it up, you can only appreciate the efforts, once you have undergone the process.

Hence, the best way to learn Data Science is to do Data Science. There is no substitute to it.

It doesn’t matter whether you are using R or Python or Weka – the best approach to learn data science is to learn the basics of the tool you are using (e.g. How is data stored? How can you access specific data points? How to make data manipulations? etc.) and then just start working on a data science problem / project.datasets, data science, machine learning

In order to help you learn data science, I have listed some of the datasets I recommend, along with the reason, why I have included them in the mix. All these datasets are available for free over the internet and provide a glimpse of how data science is changing the world, we live in.

These datasets would appeal to you, irrespective of the fact whether you are a newbie or a pro. Here are 5 datasets and the reasons why I recommend them:

  1. Titanic dataset from Kaggle: This is the first dataset, I recommend to any starter and for a good reason – the problem looks simple at the outset. Yet, it provides a good understanding of what a typical data science project involves. The starters can work on the dataset in excel and the pros can work on advanced tools to extract hidden information and algorithms to substitute some of the missing values in the dataset. Another cool aspect is that you can rank yourself against other data scientists on Kaggle to see where you stand. This dataset is just the introduction you need, before you delve into the world of Kaggle.
  2. Learning to mine twitter on a topic: This project is included in the list, so that beginners can correlate to the power of data science. With help of twitter and a good data science tool, you can find out what the world is saying about a particular topic. I was mesmerized by this, when I did this for the first time. Be it reviews about movies, sentiments about elections or any hot topic off the press – you can know what the people are saying by yourself. Performing this exercise not only helps you understand some of the challenges in mining social media (especially, if you are interested in text mining), it also helps you understand how easy it is to integrate an API in your scripts to access the information available on social media.
  3. Human activity recognition using smartphone dataset: This problem makes into the list because it is a segmentation problem (different to the previous 2 problems) and there are various solutions available on the internet to aid your learning. It is an interesting application, if you have ever wondered how does your smartphone know what you are doing right now. Another reason to solve this problem is that it helps you understand a different kind of problem – one where there are no missing values (because the collection is happening in automated manner), so the focus is on data munging and learning.
  4. Hubway Visualization challenge:  This problem focuses on data visualization and not prediction / machine learning explicitly (No one stops you from applying those though). The questions mentioned in the challenge help understand the challenges a business can solve with help of Business Intelligence tools. Again, there are bunch of interesting visualizations available on the internet to see what some of the best minds have produced.
  5. Movielens data: I couldn’t have left this data set out. Bigger than some of the other data sets mentioned in the article, but provides a lot of fun. The dataset is sufficient to build a recommender system and see which movies are liked by what kind of audience.
    1. Readme file: http://files.grouplens.org/papers/ml-1m-README.txt
    2. Dataset: http://www.grouplens.org/system/files/ml-1m.zip

These are the five datasets, I recommend to people starting in the industry. They provide a healthy mix of different types of challenges you face as a data scientist. Each of these datasets provide a bunch of learning and would probably leave you wanting for more.

If you are aware of other open datasets, which you recommend to people starting their journey on data science, please feel free to suggest them along with the reasons, why they should be included. If the reason is good, I’ll include them in the list.

If you like what you just read & want to continue your analytics learning, subscribe to our emailsfollow us on twitter or like our facebook page.

Kunal Jain 27 Aug 2021

Kunal is a post graduate from IIT Bombay in Aerospace Engineering. He has spent more than 10 years in field of Data Science. His work experience ranges from mature markets like UK to a developing market like India. During this period he has lead teams of various sizes and has worked on various tools like SAS, SPSS, Qlikview, R, Python and Matlab.

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

Sudhindra
Sudhindra 11 Nov, 2014

Awesome post...for any starter like me!!

Love Shah
Love Shah 11 Nov, 2014

Awesome stuff! Thanks for sharing.

atul
atul 14 Nov, 2014

Hi Kunal, I am a big fan of your work and I am very grateful to you & Analytics Vidhya for sharing all these useful information. Also please start something on recommender system or if you have any music related content dataset to make a content recommending system. It will be great help. thanks and regards, Atul Rawat

Akash
Akash 01 Apr, 2015

Can someone name the best universities in world that teach the BA program?

Jigar
Jigar 23 Nov, 2015

Hi Guys, I recently found Analytics Vidhya, and immediately loved your articles, tutorials, and all the effort you guys are putting into educating in the field of analytics. Another dataset that I'd recommend is the Fuel Economy dataset from the site below. This is a car lovers dream. This dataset (although a small one) lists 37000 cars/trucks/etc. with their emission rating, mpg value, drive, etc between 1984 to 2016. ------------------------------------------------------ http://fueleconomy.gov/feg/download.shtml ----------------------------------------------------- A lot of correlation, exploration, visualizations are possible, linking fuel type, with mpg, performance improvement over various car models, emission ratings, etc. Check it out ! Thanks, Jigar

nagarjuna
nagarjuna 17 Jan, 2016

is if any open datasets from FACEBOOK, if any pls share ..

Catherina
Catherina 06 Feb, 2016

I am a beginner studying data science. For a college term project, roughly how much time will it take to build a mini movies/songs recommender system ? Please let me know of any other useful project ideas with the estimated completion time. Thanks

Deepti Nema
Deepti Nema 01 Oct, 2017

Good information kunal..

pardhu
pardhu 17 Nov, 2017

i'm unable to download the titanic file. can anyone help me

pardhu
pardhu 17 Nov, 2017

and i am new to the analytics industry i want to do some analytics projects. can any one guide me

vkr
vkr 26 Mar, 2018

Thank you for sharing dataset

Related Courses