If there is one sentence, which summarizes the essence of learning data science, it is this:
The best way to learn data science is to apply data science.
If you are a beginner, you improve tremendously with each new project you undertake. If you are an experienced data science professional, you already know what I am talking about.
However, when I give this advice to people, they usually ask something in return – Where can I get datasets for practice? They don’t realize the amount of data sets available in open. They fail to realize the amount of learning they can get out from working on these projects to get a boost in their career.
If you think that the situation above applies to you – Don’t worry! you are just at the right place. This article will provide you a list of websites / resources from which you can use data to do your own (pet) projects or even create your own products.
How can you use these sources?
There is no end to how you can use these data sources. The application and usage is only limited by your creativity and application.
The simplest way to use them is to create data stories and publishing them over web. This would not only improve your data and visualization skills, but also improve your structured thinking.
On the other hand, if you are thinking / working on a data based product, these datasets could add power to your product by providing additional / new input data.
So, go ahead, work on these projects and share them with the larger world to showcase your data prowess!
I have divided these sources in various sections to help you categorize data sources based on application. We start with simple, generic and easy to handle datasets and then move to huge / industry relevant datasets. We then provide links to dataset for specific purpose – Text Mining, Image classification, Recommendation engine etc. This should provide you a holistic list of data resources.
If you can think of any application of these datasets or know of any popular resources which I have missed, please feel free to share them with me in the comments below.
Simple & Generic datasets to get you started
- data.gov – This is the home of the U.S. Government’s open data. The site contains more than 190,000 data points at time of publishing. These datasets vary from data about climate, education, energy, Finance and many more areas.
- data.gov.in – This is the home of the Indian Government’s open data. Find data by various industries, climate, health care etc. You can check out a few visualizations for inspiration here. Depending on your country of residence, you can also follow similar websites from a few other websites – check them out.
- World Bank – The open data from the World bank. The platform provides several tools like Open Data Catalog, world development indices, education indices etc.
- RBI – Data available from the Reserve Bank of India. This includes several metrics on money market operations, balance of payments, use of banking and several products. A must go to site, if you come from BFSI domain in India.
- Five Thirty Eight Datasets – Here is a link to datasets used by Five Thirty Eight in their stories. Each dataset includes the data, a dictionary explaining the data and the link to the story carried out by Five Thirty Eight. If you want to learn how to create data stories, it can’t get better than this.
Huge Datasets – things are getting serious now!
- Amazon Web Services (AWS) datasets – Amazon provides a few big datasets, which can be used on their platform or on your local computers. You can also analyze the data in the cloud using EC2 and Hadoop via EMR. Popular datasets on Amazon include full Enron email dataset, Google Books n-grams, NASA NEX datasets, Million Songs dataset and many more. More information can be found here.
- Google datasets – Google provides a few datasets as part of its Big Query tool. This includes baby names, data from GitHub public repositories, all stories & comments from Hacker News etc.
- Youtube labeled Video Dataset
A few months back, Google Research Group released YouTube labeled dataset, which consists of 8 million YouTube video IDs and associated labels from 4800 visual entities. It comes with pre-computed, state-of-the-art vision features from billions of frames.
Datasets for predictive modeling & machine learning:
- UCI Machine Learning Repository – UCI Machine Learning Repository is clearly the most famous data repository. It is usually the first place to go, if you are looking for datasets related to machine learning repositories. The datasets include a diverse range of datasets from popular datasets like Iris and Titanic survival to recent contributions like that of Air Quality and GPS trajectories. The repository contains more than 350 datasets with labels like domain, purpose of the problem (Classification / Regression). You can use these filters to identify good datasets for your need.
- Kaggle Kaggle has come up with a platform, where people can donate datasets and other community members can vote and run Kernel / scripts on them. They have more than 350 datasets in total – with more than 200 as Featured datasets. While some of the initial datasets were usually present at other places, I have seen a few interesting datasets on the platform, not present at other places. Along with new datasets, another benefit of the interface is that you can see scripts and questions from community members on the same interface.
- Analytics Vidhya You can participate and download datasets from our practice problems and hackathon problems. The problem datasets are based on real-life industry problems and are relatively smaller as they are meant for 2 – 7 days hackathons. While practice problems are available to people always, the hackathon problems become unavailable after the hackathons. So, you need to participate on the hackathon to get access to the datasets.
- Quandl Quandl provides financial, economic and alternative data from various sources through their website / API or direct integration with a few tools. Their datasets are classified as Open or Premium. You can access all the open datasets for Free, but you need to pay for the premium datasets. If you search, you still get good datasets on the platform. Eg. Stock Exchange data from India is available for free.
- Past KDD Cups KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining. Archives includes datasets and instructions. Winners are available for most years.
- Driven Data Driven Data finds real-world challenges where data science can be used to create a positive social impact. They then run online modeling competitions for data scientists to develop the best models to solve them. If you are interested in use of data science for social good – this is the place to be.
Image classification datasets
- The MNIST Database – The most popular dataset for image recognition using hand-written digits. It includes 60,000 train examples and a test set of 10,000 examples. This serves as typically the first dataset to practice image recognition.
- Chars74K – Here is the next level of evolution, if you have passed hand written digits. This dataset includes character recognition in natural images. The dataset contains 74,000 images and hence the name of the dataset.
- Frontal Face Images If you have worked on previous 2 projects and are able to identify digits and characters, here is the next level of challenge in Image recognition – Frontal Face images. The images were collected by CMU & MIT and are arranged in four folders.
- ImageNet Time to build something generic now. Image database organised according to the WordNet hierarchy (currently only the nouns). Each node of the hierarchy is depicted by hundreds of images. Currently, the collection has an average of over five hundred images per node (and increasing).
Text Classification datasets
- Spam – Non Spam An interesting problem with 1324 SMSs (Span and non-spam). You need to build a classifier classifying the SMS as span or non-spam.
- Twitter Sentiment Analysis The Twitter Sentiment Analysis Dataset contains 1,578,627 classified tweets, each row is marked as 1 for positive sentiment and 0 for negative sentiment. The data is in turn based on a Kaggle competition and analysis by Nick Sanders.
- Movie Review Data This site provides collections of movie-review documents labeled on their overall sentiment polarity (positive or negative) or subjective rating (e.g., “two and a half stars”) and sentences labeled with respect to their subjectivity status (subjective or objective) or polarity
Datasets for Recommendation Engine
- MovieLens MovieLens is a web site that helps people find movies to watch. It has hundreds of thousands of registered users. They conduct online field experiments in MovieLens in the areas of automated content recommendation, recommendation interfaces, tagging-based recommenders. These datasets are available for download and can be used to create your own recommender systems.
- Jester Datasets about online joke recommender system
Websites which Curate list of datasets from various sources:
- KDNuggets – The dataset page on KDNuggets has long been a reference point for people looking for datasets out there. A really comprehensive list, however some of the sources no longer provide the datasets. So, you will need to apply your own prudence on the datasets and the sources.
- Awesome Public Datasets A GitHub repository with a comprehensive list of datasets categorized by domain. Datasets are classified neatly in various domains, which is very helpful. However, there is no description about the datasets on the repository itself – which could have made it very useful.
- Reddit Datasets Subreddit Since this is a community driven forum, it might come across a bit messy (compared to previous 2 sources). However, you can sort datasets by popularity / votes to see the most popular ones. Also, it has some interesting datasets and discussions.
I hope that this list of resources would prove extremely useful for people looking out for doing pet projects or side projects. For the starters, this is definitely a gold mine. Make sure you pick a few side projects and continue to work on them. If you can think of any application of these datasets or know of any popular resources which I have missed, please feel free to share them with me in the comments below.
Looking forward to hearing from you.