- It is important to actually work on different kinds of data and projects along with learning the data science concepts
- Some datasets are very popular and a lot more are easily available on the web
For a more exhaustive list of datasets and data science projects, please refer to this more recent article :
Nothing beats the learning which happens on the job!
Whether it is the challenges you face while collecting the data or cleaning it up, you can only appreciate the efforts, once you have undergone the process.
Hence, the best way to learn Data Science is to do Data Science. There is no substitute to it.
It doesn’t matter whether you are using R or Python or Weka – the best approach to learn data science is to learn the basics of the tool you are using (e.g. How is data stored? How can you access specific data points? How to make data manipulations? etc.) and then just start working on a data science problem / project.
In order to help you learn data science, I have listed some of the datasets I recommend, along with the reason, why I have included them in the mix. All these datasets are available for free over the internet and provide a glimpse of how data science is changing the world, we live in.
These datasets would appeal to you, irrespective of the fact whether you are a newbie or a pro. Here are 5 datasets and the reasons why I recommend them:
- Titanic dataset from Kaggle: This is the first dataset, I recommend to any starter and for a good reason – the problem looks simple at the outset. Yet, it provides a good understanding of what a typical data science project involves. The starters can work on the dataset in excel and the pros can work on advanced tools to extract hidden information and algorithms to substitute some of the missing values in the dataset. Another cool aspect is that you can rank yourself against other data scientists on Kaggle to see where you stand. This dataset is just the introduction you need, before you delve into the world of Kaggle.
- Learning to mine twitter on a topic: This project is included in the list, so that beginners can correlate to the power of data science. With help of twitter and a good data science tool, you can find out what the world is saying about a particular topic. I was mesmerized by this, when I did this for the first time. Be it reviews about movies, sentiments about elections or any hot topic off the press – you can know what the people are saying by yourself. Performing this exercise not only helps you understand some of the challenges in mining social media (especially, if you are interested in text mining), it also helps you understand how easy it is to integrate an API in your scripts to access the information available on social media.
- Reference: Who is the world cheering for?
- Human activity recognition using smartphone dataset: This problem makes into the list because it is a segmentation problem (different to the previous 2 problems) and there are various solutions available on the internet to aid your learning. It is an interesting application, if you have ever wondered how does your smartphone know what you are doing right now. Another reason to solve this problem is that it helps you understand a different kind of problem – one where there are no missing values (because the collection is happening in automated manner), so the focus is on data munging and learning.
- Hubway Visualization challenge: This problem focuses on data visualization and not prediction / machine learning explicitly (No one stops you from applying those though). The questions mentioned in the challenge help understand the challenges a business can solve with help of Business Intelligence tools. Again, there are bunch of interesting visualizations available on the internet to see what some of the best minds have produced.
- Movielens data: I couldn’t have left this data set out. Bigger than some of the other data sets mentioned in the article, but provides a lot of fun. The dataset is sufficient to build a recommender system and see which movies are liked by what kind of audience.
These are the five datasets, I recommend to people starting in the industry. They provide a healthy mix of different types of challenges you face as a data scientist. Each of these datasets provide a bunch of learning and would probably leave you wanting for more.
If you are aware of other open datasets, which you recommend to people starting their journey on data science, please feel free to suggest them along with the reasons, why they should be included. If the reason is good, I’ll include them in the list.