25+ websites to find datasets for data science projects

Kunal Jain 22 Aug, 2020 • 7 min read

Introduction

If there is one sentence, which summarizes the essence of learning data science, it is this:

The best way to learn data science is to apply data science.

If you are a beginner, you improve tremendously with each new project you undertake. If you are an experienced data science professional, you already know what I am talking about.

However, when I give this advice to people, they usually ask something in return – Where can I get datasets for practice? They don’t realize the amount of data sets available in open. They fail to realize the amount of learning they can get out from working on these projects to get a boost in their career.

If you think that the situation above applies to you – Don’t worry! you are just at the right place. This article will provide you a list of websites / resources from which you can use data to do your own (pet) projects or even create your own products.

How can you use these sources?

There is no end to how you can use these data sources. The application and usage is only limited by your creativity and application.

The simplest way to use them is to create data stories and publishing them over web. This would not only improve your data and visualization skills, but also improve your structured thinking.

On the other hand, if you are thinking / working on a data based product, these datasets could add power to your product by providing additional / new input data.

So, go ahead, work on these projects and share them with the larger world to showcase your data prowess!

I have divided these sources in various sections to help you categorize data sources based on application. We start with simple, generic and easy to handle datasets and then move to huge / industry relevant datasets. We then provide links to dataset for specific purpose – Text Mining, Image classification, Recommendation engine etc. This should provide you a holistic list of data resources.

If you can think of any application of these datasets or know of any popular resources which I have missed, please feel free to share them with me in the comments below.

Simple & Generic datasets to get you started

data.gov – This is the home of the U.S. Government’s open data. The site contains more than 190,000 data points at time of publishing. These datasets vary from data about climate, education, energy, Finance and many more areas.

data.gov.in – This is the home of the Indian Government’s open data. Find data by various industries, climate, health care etc. You can check out a few visualizations for inspiration here. Depending on your country of residence, you can also follow similar websites from a few other websites – check them out.

World Bank – The open data from the World bank. The platform provides several tools like Open Data Catalog, world development indices, education indices etc.

RBI – Data available from the Reserve Bank of India. This includes several metrics on money market operations, balance of payments, use of banking and several products. A must go to site, if you come from BFSI domain in India.

Five Thirty Eight Datasets – Here is a link to datasets used by Five Thirty Eight in their stories. Each dataset includes the data, a dictionary explaining the data and the link to the story carried out by Five Thirty Eight. If you want to learn how to create data stories, it can’t get better than this.

Huge Datasets – things are getting serious now!

Amazon Web Services (AWS) datasets – Amazon provides a few big datasets, which can be used on their platform or on your local computers. You can also analyze the data in the cloud using EC2 and Hadoop via EMR. Popular datasets on Amazon include full Enron email dataset, Google Books n-grams, NASA NEX datasets, Million Songs dataset and many more. More information can be found here.

Google datasets – Google provides a few datasets as part of its Big Query tool. This includes baby names, data from GitHub public repositories, all stories & comments from Hacker News etc.

Youtube labeled Video Dataset
A few months back, Google Research Group released YouTube labeled dataset, which consists of 8 million YouTube video IDs and associated labels from 4800 visual entities. It comes with pre-computed, state-of-the-art vision features from billions of frames.

Datasets for predictive modeling & machine learning:

UCI Machine Learning Repository – UCI Machine Learning Repository is clearly the most famous data repository. It is usually the first place to go, if you are looking for datasets related to machine learning repositories. The datasets include a diverse range of datasets from popular datasets like Iris and Titanic survival to recent contributions like that of Air Quality and GPS trajectories. The repository contains more than 350 datasets with labels like domain, purpose of the problem (Classification / Regression). You can use these filters to identify good datasets for your need.

Kaggle Kaggle has come up with a platform, where people can donate datasets and other community members can vote and run Kernel / scripts on them. They have more than 350 datasets in total – with more than 200 as Featured datasets. While some of the initial datasets were usually present at other places, I have seen a few interesting datasets on the platform, not present at other places. Along with new datasets, another benefit of the interface is that you can see scripts and questions from community members on the same interface.

Analytics Vidhya You can participate and download datasets from our practice problems and hackathon problems. The problem datasets are based on real-life industry problems and are relatively smaller as they are meant for 2 – 7 days hackathons. While practice problems are available to people always, the hackathon problems become unavailable after the hackathons. So, you need to participate on the hackathon to get access to the datasets.

Quandl Quandl provides financial, economic and alternative data from various sources through their website / API or direct integration with a few tools. Their datasets are classified as Open or Premium. You can access all the open datasets for Free, but you need to pay for the premium datasets. If you search, you still get good datasets on the platform. Eg. Stock Exchange data from India is available for free.

Past KDD Cups KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining. Archives includes datasets and instructions. Winners are available for most years.

Driven Data Driven Data finds real-world challenges where data science can be used to create a positive social impact. They then run online modeling competitions for data scientists to develop the best models to solve them. If you are interested in use of data science for social good – this is the place to be.

Image classification datasets

The MNIST Database – The most popular dataset for image recognition using hand-written digits. It includes 60,000 train examples and a test set of 10,000 examples. This serves as typically the first dataset to practice image recognition.

Chars74K – Here is the next level of evolution, if you have passed hand written digits. This dataset includes character recognition in natural images. The dataset contains 74,000 images and hence the name of the dataset.

Frontal Face Images If you have worked on previous 2 projects and are able to identify digits and characters, here is the next level of challenge in Image recognition – Frontal Face images. The images were collected by CMU & MIT and are arranged in four folders.

ImageNet Time to build something generic now. Image database organised according to the WordNet hierarchy (currently only the nouns). Each node of the hierarchy is depicted by hundreds of images. Currently, the collection has an average of over five hundred images per node (and increasing).

Text Classification datasets

Twitter Sentiment Analysis The Twitter Sentiment Analysis Dataset contains 1,578,627 classified tweets, each row is marked as 1 for positive sentiment and 0 for negative sentiment. The data is in turn based on a Kaggle competition and analysis by Nick Sanders.

Datasets for Recommendation Engine

MovieLens MovieLens is a web site that helps people find movies to watch. It has hundreds of thousands of registered users. They conduct online field experiments in MovieLens in the areas of automated content recommendation, recommendation interfaces, tagging-based recommenders. These datasets are available for download and can be used to create your own recommender systems.

Jester Datasets about online joke recommender system

Websites which Curate list of datasets from various sources:

KDNuggets – The dataset page on KDNuggets has long been a reference point for people looking for datasets out there. A really comprehensive list, however some of the sources no longer provide the datasets. So, you will need to apply your own prudence on the datasets and the sources.

Awesome Public Datasets A GitHub repository with a comprehensive list of datasets categorized by domain. Datasets are classified neatly in various domains, which is very helpful. However, there is no description about the datasets on the repository itself – which could have made it very useful.

Reddit Datasets Subreddit Since this is a community driven forum, it might come across a bit messy (compared to previous 2 sources). However, you can sort datasets by popularity / votes to see the most popular ones. Also, it has some interesting datasets and discussions.

End Notes

I hope that this list of resources would prove extremely useful for people looking out for doing pet projects or side projects. For the starters, this is definitely a gold mine. Make sure you pick a few side projects and continue to work on them. If you can think of any application of these datasets or know of any popular resources which I have missed, please feel free to share them with me in the comments below.

Looking forward to hearing from you.

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Kunal Jain 22 Aug 2020

Kunal is a post graduate from IIT Bombay in Aerospace Engineering. He has spent more than 10 years in field of Data Science. His work experience ranges from mature markets like UK to a developing market like India. During this period he has lead teams of various sizes and has worked on various tools like SAS, SPSS, Qlikview, R, Python and Matlab.

Beginner Big data Business Analytics Business Intelligence Listicle

Responses From Readers

Krishna 24 Nov, 2016

Great post Kunal.

Show 1 reply

Kunal Jain 24 Nov, 2016

Thanks Krishna

Terpolilli 24 Nov, 2016

Hi Kunal, thanks for the article and all the sources :) You may want to check OpenDataSoft -> http://data.opendatasoft.com or https://opendatainception.io/ as other data sources. Nicolas

Show 1 reply

Kunal Jain 27 Nov, 2016

Thanks Terpolilli. Will check it out

Doumbia 24 Nov, 2016

Thanks a lot Kunal ! That is helpfull for us learners !

Show 1 reply

Kunal Jain 27 Nov, 2016

Glad that you liked it Doumbia

Saurabh 24 Nov, 2016

Very informative... :)

Adrian 24 Nov, 2016

thanks for long list :) My favourite is a kaggle because its platform allow community to share insights and scripts.

Diego Lima 24 Nov, 2016

I just want to tell you that this article has changed my life!!! I'm very busy at the moment but my vacations are very close. I will soon dive into some of these datasets like there is no tomorrow! This is amazing! Thank you so much! Keep up the good work!

Deepak Soni 24 Nov, 2016

Thanks Kunal, This is really helpful !!

Sanmati Jain 25 Nov, 2016

Superb list.

MANIK DEY 25 Nov, 2016

wow what a great source of information. Thank you so much for sharing the information.

Abhijit Ray 25 Nov, 2016

Great data information. Thanks for sharing it.

Md. Rayhanul Islam 26 Nov, 2016

It's really helpful article. Thanks all of your articles in Analytic Vidhya. I want to request you to add following link in your article for Demographic & Health Survey Data. http://dhsprogram.com/data/available-datasets.cfm I think that will help.

Selene 26 Nov, 2016

Thanks, Kunal! This is a very useful list. An popular resource you did miss is data.world data.world is a social network for data people where you can find datasets that would fall into many of the categories you listed. :)

M. Lindauer 22 Jun, 2017

OpenML (www.openml.org) is definitely missing in this list. It's a great website with many data sets and various ways to compare results of other users.

iq 07 Aug, 2017

Hey! I need a data set of models used in CBSD for component development. Anyone knows about the source? Please tell me.

omid 09 Sep, 2017

hi ! I need dataset about smart city . I am working one algoritm for semanatic sensor in smart city , CAN I help me?

Data Science Training In Hyderabad 23 Dec, 2017

Hi, Thanks for sharing such a great article with us on Data Science

25+ websites to find datasets for data science projects

Introduction

How can you use these sources?

Simple & Generic datasets to get you started

Huge Datasets – things are getting serious now!

Datasets for predictive modeling & machine learning:

Image classification datasets

Text Classification datasets

Datasets for Recommendation Engine

Websites which Curate list of datasets from various sources:

End Notes

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Frequently Asked Questions

Responses From Readers

Related Courses

Top Data Science Projects for Analysts and Data Scientists

Free

Write for us