Interesting Kaggle Datasets Every Beginner in Data Science Should Try Out

Prateek Majumder 21 Apr, 2021 • 6 min read
This article was published as a part of the Data Science Blogathon.

Introduction

These days, Kaggle has indeed become one of the most important stepping stones for students and professionals venturing into Data Science. 

Kaggle datasets image

Kaggle has a lot of online resources that help one to get started with Data Science. It has thousands of Datasets, Data Science competitions, Code Submissions on the Datasets, Community chat, and even Beginner-friendly courses. The user also gets a shareable public user profile, which tracks and shows all of the user’s contributions and achievements.

The user profile shows whom the user follows, who follows the user, code by the user, any datasets by the user, and other information. There are also various ranking methods. The kaggle profile serves as a good way to create online projects which are shareable and show your talent. Just like how your HackerEarth or Code Chef profile shows your competitive coding skills, your kaggle profile serves as a way to express your Data Science skills.

To build a good kaggle profile, one needs to work on the data and build high-quality Python or R notebooks in the form of projects and tell a tale through the data. One can add various data plots, write markdown, and train models on Kaggle Notebooks. There is a lot one can do using them. And the best thing about Kaggle Notebooks is that: the user doesn’t need to install Python or R on their computer to use it. Almost all major libraries can be directly imported. Kaggle also provides TPUs for free. Tensor Processing Units (TPUs) are hardware accelerators specialized in deep learning tasks. They are supported in Tensorflow 2.1 both through the Keras high-level API and, at a lower level, in models using a custom training loop.

So, working with Datasets on Kaggle is very easy and convenient and all beginners must try Kaggle, so as to build up some skill and knowledge.

Here are some datasets every beginner can try and build awesome projects –

1. Netflix Movies and TV Shows

Kaggle Datasets netflix

Who doesn’t like Netflix? This dataset on kaggle has tv shows and movies available on Netflix. One can create a good quality Exploratory Data Analysis project using this dataset. Using this dataset, one can find out: what type of content is produced in which country, identify similar content from the description, and much more interesting tasks.

  1. Link to Dataset

My favorite Notebooks-

  1. EDA on Netflix Notebook
  2. Netflix Data: Analysis and Visualization Notebook

2. Students Performance in Exams

Kaggle Datasets exam performance

This data is based on population demographics. The data contains various features like the meal type given to the student, test preparation level, parental level of education, and students’ performance in Math, Reading, and Writing. Using the data, various types of Regression and Classification problems can be solved. It can also be used to find which factors can lead to better exam scores. Overall, it will be interesting to work on.

  1. Link to Dataset

My favorite Notebooks-

  1. Student Performance In Exams Notebook

 

3. Mobile Price Classification

 

Kaggle Datasets  mobile price classification

The Mobile Price Classification dataset has a lot of data features and a wide variety of data following various distribution patterns. There are categorical features, Numerical continuous data, and even binary data. A lot of data patterns ensures that one is able to work with a lot of data and deal with various mathematical computations and statistics.

  1. Link to Dataset

My favorite Notebooks-

  1. Mobile Price Prediction Notebook
  2. Mobile Price Prediction #2

 

4. Dogs & Cats Images

Kaggle Datasets dogs and cats

The classic Dog vs Cat classification dataset. There are a lot of Dog and Cat images that can be used to train models and do predictions. This dataset is a must for students trying to get into Image Processing or Computer Vision. Also, you get to look at a lot of cute images of cats and dogs.

  1. Link to Dataset

My favorite Notebooks-

  1. Dogs and Cats Image-Classifier Notebook

 

5. Trip Advisor Hotel Reviews

 

trip advisor

Hotels are important parts of trips and vacations. Hotel reviews are text data, which can be worked up using Natural Language Processing (NLP) methods. There are over 20,000 hotel reviews followed by a star rating of 1 to 5. The dataset can be used to train a classification model to determine the star rating of a given test review. It can be a good stepping stone for getting into text analytics and NLP.

  1. Link to Dataset

My favorite Notebooks-

  1. Hotel Reviews Sentiment prediction Notebook

 

6. Melbourne Housing Market

Melbourne Housing Market

Melbourne Housing Market dataset is an all-time favorite learning resource for beginners into data science. It has a lot of features: numeric, categorical, and even geographic data ( Latitude and Longitude). So it can also be used for geospatial analysis and other clustering problems. Similarly, regression and classification tasks can also be performed on this dataset. There are also numerous code samples and guides available for this dataset, making it the ideal dataset for learners.

  1. Link to Dataset

My favorite Notebooks-

  1. Melbourne || Comprehensive Housing Market Analysis Notebook
  2. Melboune real estate market comprehensive analysis Notebook

 

7. Churn Modelling

 

 Churn Modelling

Employee churn rate indicates how frequently the company’s employees quit their jobs within a given period. It is an important aspect of HR Analytics and corporate strategy. Data are real-life features like age, gender, time of bond with the company, and other important features. The data can be used to create a classification model and explore interesting patterns in data.

  1. Link to Dataset

My favorite Notebooks-

  1. Churn-Classification Notebook

 

8. Amazon Top 50 Bestselling Books 2009 – 2019

Amazon Top 50 Bestselling Books

A sales dataset is always interesting to work with and gain insights from. Features include Amazon user rating, number of reviews on Amazon, and others. This dataset can be used to create EDA projects and also create regression analysis. It can be used to create an interesting case study on the success of Bestselling books.

  1. Link to Dataset

My favorite Notebooks-

  1. Amazon Top 50 Bestselling Books Notebook

 

9. Medical Cost Personal Dataset

 

Medical Cost Personal Dataset

This dataset is used to do Insurance Forecast based on various features. Interesting features include BMI, Number of Children, and if the person is a smoker or not. It also falls under the Demographics category and can be used to show an analysis of a person’s Insurance Expenditure.

  1. Link to Dataset

My favorite Notebooks-

  1. Patient Charges || Clustering and Regression Notebook

 

10. Kepler Exoplanet Search Results

 

Kepler

Kepler had verified 1284 new exoplanets as of May 2016. As of October 2017, there are over 3000 confirmed exoplanets total (using all detection methods, including ground-based ones). The telescope is still active and continues to collect new data on its extended mission.

The data has various features, all of which might be a bit difficult to understand. A detailed explained guide can be found here.

  1. Link to Dataset

 

End Notes

There are a lot of Notebooks on this dataset, it might be a bit difficult for beginners, but a lot of work can be done on this dataset.

There are a lot more datasets and challenges available on Kaggle, plenty for beginners to learn from. One can also use their Kaggle profile as a means to express their skills in Data Science.

The media shown in this article on Kaggle Datasets are not owned by Analytics Vidhya and is used at the Author’s discretion. 

Prateek Majumder 21 Apr 2021

Prateek is a final year engineering student from Institute of Engineering and Management, Kolkata. He likes to code, study about analytics and Data Science and watch Science Fiction movies. His favourite Sci-Fi franchise is Star Wars. He is also an active Kaggler and part of many student communities in College.

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

Ramesh Sampangi
Ramesh Sampangi 22 Apr, 2021

Hi Pratheekmaj, well-written information. First of all, I would like to thank you for sharing such a wonderful piece of information. I agree with your statement that every fresher in the data science field should try out the Kaggle data sets for a better experience. Once again, thanks for sharing this article.