New to Kaggle? Here’s How you can Get Started with Kaggle Competitions

Aniruddha Bhandari Last Updated : 27 Oct, 2024

12 min read

Overview

Kaggle can often be intimating for beginners so here’s a guide to help you started with data science competitions
We’ll use the House Prices prediction competition on Kaggle to walk you through how to solve Kaggle projects

Kaggle your way to the top of the Data Science World!

Kaggle is the market leader when it comes to data science hackathons. I started my own data science journey by combing my learning on both Analytics Vidhya as well as Kaggle – a combination that helped me augment my theoretical knowledge with practical hands-on coding.

Now, here’s the thing about Kaggle. It has a vast collection of datasets and data science competitions but that can quickly become overwhelming for any beginner. I remember browsing through Kaggle during my initial data science days and thinking, “where do I even begin?”. Given the expertise involved, it’s quite a daunting prospect for newcomers.

In this article, I am going to ease that transition for you.

We will understand how to make your first submission on Kaggle by working through their House Price competition. We’ll go through the different steps you would need to take in order to ace these Kaggle competitions, such as feature engineering, dealing with outliers (data cleaning), and of course, model building.

You can also check out the DataHack platform which has some very interesting data science competitions as well.

Please note that I’m assuming you’re familiar with Python and linear regression. If these are new concepts to you, you can learn or brush up here:

Getting Familiar with Kaggle Notebooks
Importing the Dataset in Kaggle
Let’s Explore the Data
Performing Data Preprocessing Steps
- Feature Transformation
- Dealing with Outliers
- Handling Missing Data
It’s Time for Feature Engineering
Preparing Data for Prediction
Let’s Make Some Predictions
- Linear Regression
- Ridge Regression
Make your first Kaggle Submission

Get Familiar with Kaggle Notebooks

Kaggle notebooks are one of the best things about the entire Kaggle experience. These notebooks are free of cost Jupyter notebooks that run on the browser. They have amazing processing power which allows you to run most of the computational hungry machine learning algorithms with ease!

Just check out the power of these notebooks (with the GPU on):

As I mentioned earlier, we will be working on the House Prices prediction challenge. You can follow the processes in this article by working alongside your own Kaggle notebook.

Just head to the House Prices competition page, join the competition, then head to the Notebooks tab and click Create New Notebook. You should see the following screen:

Here, you have to choose the coding language and accelerator settings you require and hit the Create button:

Your very own Kaggle notebook will load up with the basic libraries already imported for you. Additionally, you can access the training data directly from here and whatever changes you make here will be automatically saved. What more do you need?

Now let’s get cracking on that competition!

Importing the Dataset in Kaggle

Once we have our Kaggle notebook ready, we will load all the datasets in the notebook. In this competition, we are provided with two files – the training and test files. We will load these datasets using Pandas’ read_csv() function:

import pandas as pd
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
print(train.shape, test.shape)

Let’s have a look at our dataset using the DataFrame.head() function which by default outputs the top 5 rows of the dataset:

The dataset has 81 columns. The ‘SalePrice‘ column is our target feature determined by the remaining columns in the dataset. We can also observe that there is a mix of both categorical and continuous columns and there are some missing values in the data. Let us explore the data in detail in the next section.

Let’s Explore the Data

The first step in data exploration is to have a look at the columns in the dataset and what values they represent. We can do this using the DataFrame.info() function:

Note: You can read about what these features represent in the data description file provided on the competition page.

You will notice that quite a few of the features contain missing values. Before the model building process, we will have to impute these missing values. That’s a preprocessing step and we will handle it in a later section.

But first, let us explore our target feature using the DataFrame.describe() function:

Here, 25%, 50%, and 75% denote the values at 25th, 50th, and 75th percentile respectively. So, from the output, we can make out that 75% of our values are below 214,000 whereas the maximum sale price of a house is 755,000. There is a significant difference between these two which clearly denotes that the target variable has some outliers.

Read more about percentiles here.

Let’s visualize the distribution in the SalePrice feature using the sns.distplot() function in Seaborn:

You can see that a lot of the sale prices are clustered between the 100,000 to 200,000 range. But, due to some high sale prices of a few houses, our data does not seem to be centered around any value. This means that the sale prices are not symmetrical about any value. This asymmetry present in our data distribution is called Skewness. In our case, the data distribution is positively-skewed (or right-skewed).

Note: You can read more about skewness here.

We can check the skewness in our data explicitly using the DataFrame.skew() function:

We have got a positive value here because our data distribution is skewed towards the right due to the high sale prices of some houses.

Our problem requires us to predict the sale price of houses – a regression problem. So, the first model that we will be fitting to our dataset is a linear regression model. But the skewness in our target feature poses a problem for a linear model because some values will have an asymmetric effect on the prediction. Having a normally distributed data is one of the assumptions of linear regression! But we’ll handle this later when we are transforming our features.

For now, let’s have a look at how our features are correlated with each other using a heatmap in Seaborn:

Heatmaps are a great tool to quickly visualize how a feature correlates with the remaining features. Some striking correlation between features that I can see from the heatmap are:

GrLivArea and TotRmsAbvGrd
GarageYrBlt and YearBuilt
1stFlrSF and TotalBsmtSF
OverallQual and SalePrice
GarageArea and GarageCars

We can plot these features to understand the relationship between them:

It seems obvious that the total number of rooms above the ground should increase with increasing living area above ground:

This relationship is interesting because we can see some linear relationship forming between the Year the house was built and the Year the garage was built. Think about it – it seems intuitive that garages would have been built either simultaneously with the house or after it was constructed, and not before it. Therefore, you can see that most of the points stay on or below the linear line.

Again, we can see a linear relationship between these two features, and most of the dots lie below the line. Most houses have a basement area less than or equivalent to the first-floor area. Although we can see some houses with basement area more than the first-floor area. What do you think the reason could be? I would love to read it in the comments below!

Again, the number of cars that can fit in a garage would increase with its area. You can do a lot more analysis and I encourage you to explore all the features and think of how to deal with them. While you’re at it, don’t forget to share your insights in the comments!

For now, let’s see how the features correlate with our target feature – SalePrice:

We can see that most of the features that we looked at above are also highly correlated with our target feature. So let’s try to visualize their relationship with the target feature.

I will save all of them in my “top_features” list for reference later on.

Ok, we have plotted these values, but what do you concur?

Well, you must have noticed some points in most of these plots are out of their usual place and tend to break the pattern in the feature. These are called Outliers. Outliers affect the mean and standard deviation of the dataset which can affect our predicted values.

For example, in the feature GrLivArea, notice those two points in the bottom right? An above-ground living area of 4500 square feet for just 200,000 while those with 3000 square feet sell for upwards of 200,000! Seems a bit strange, doesn’t it?

Let’s take another example, this time of TotalBsmtSF. Notice the point in the bottom right? It doesn’t make sense.

These outlier values need to be dealt with or they will affect our predictions. We can deal with them in a number of different ways and we’ll handle them later in the preprocessing section next.

Note: You can read more about outliers here.

Performing Data Preprocessing Steps

Handling any Outliers

Right – we saw how there were a few outliers in our top correlated features above. Although there are a couple of ways to deal with outliers in data, I will be dropping them here.

Any value lying beyond 1.5*IQR (interquartile range) in a feature is considered an outlier. So we will use that to detect our outliers:

These were our top features containing outlier points. Since we have dropped these points, let’s have a look at how many rows we are left with:

(1327, 81)

We have dropped a few rows as they would have affected our predictions later on.

Feature Transformation

Before we start handling the missing values in the data, I am going to make a few tweaks to the train and test dataframes.

I am going to concatenate the train and test dataframes into a single dataframe. This will make it easier to manipulate their data. Along with that, I will make a few changes to each of them:

Store the number of rows in train dataframe to separate train and test dataframe later on
Drop Id from train and test because it is not relevant for predicting sale prices
Take the log transformation of target feature using np.log() to deal with the skewness in the data
Drop the target feature as it is not present in test dataframe
Concatenate train and test datasets

Have a look at how the log transformation affected our target feature. The distribution now seems to be symmetrical and is more normally distributed:

Now it’s time to handle the missing data!

Handling missing data

Let’s have a look at how many missing values are present in our data:

There seem to be quite a few missing values in our dataset. What do you think could be the reason for this? Here’s a hint – take a look at the data description file and try to figure it out.

There are some features that have NA value for a missing parameter! This is strange but let me show you why that’s the case:

For example, NA in PoolQC feature means no pool is present in the house! This is treated as a null (or np.nan) value by Pandas and similar values are present in quite a few categorical features.

I will replace the null values in categorical features with a ‘None’ value.

For ordinal features, however, I will replace the null values with 0 and the remaining values with an increasing set of numbers. This is called Label Encoding and is used to capture the trend in an ordinal feature.

The null values in nominal features will be handled by replacing them with ‘None’ value which will be treated during One-Hot Encoding of the dataset.

Finally, the missing values in numerical features will be treated by replacing them with either a 0 or some other statistical value:

A null value in Garage features means that there is no garage in the house. These values will be handled the same way as mentioned above:

A null value in basement features indicates an absence of the basement and will be handled as mentioned above:

Null values in the remaining features can also be handled in a similar fashion:

It’s Time for Feature Engineering

Now that we have dealt with the missing values, we can Label Encode a few other features to convert to a numerical value. This retains the trend in the feature and the regression model will be able to understand the features.

Honestly, feature engineering is perhaps THE most important aspect of Kaggle competitions. A quick glance at previous winning solutions will show you how important feature engineering is. It’s often the difference between a top 20 percentile finish and a mid-leaderboard position.

We can make new features from existing data in the dataset to capture some trends in the data that might not be explicit. This makes the already existing data more useful. For example, adding a new feature that indicates the total square feet of the house is important as a house with a greater area will sell for a higher price. Similarly, a feature telling whether the house is new or not will be important as new houses tend to sell for higher prices compared to older ones.

I have made some new features below. I encourage you to go through the data yourself and see if you can come up with other useful features.

All these steps that I performed here are part of feature engineering. You can read more about them in detail in this article.

Preparing Data for Prediction

Since there a lot of categorical features in the dataset, we need to apply One-Hot Encoding to our dataset. This will convert categorical data in numbers so that the regression model can understand which category the value belongs to:

Because we had combined training and testing datasets into a single dataframe at the beginning, it is now time to separate the two:

Finally, I will split our train dataframe into training and validation datasets. This will allow us to train our model and validate its predictions without having to look at the testing dataset!

Let’s Make Some Predictions

Let’s try to predict the values using linear regression. It is the simplest regression model and you can read more about it in detail in this article.

Linear regression model

We are looking at the RMSE score here because the competition page states the evaluation metric is the RMSE score. We got a pretty decent RMSE score here without doing a lot. Now let’s see whether we can improve it using another classic machine learning technique.

Ridge regression model

Ridge regression is a type of linear regression model which allows the regularization of features to take place. Now, what is regularization?

Regularization shrinks some feature coefficients towards zero to minimize their effect on predicting the output value.

You can study more about regularization in this article.

We are getting the lowest RMSE score with an alpha value of 3. Since I got the lowest RMSE with Ridge regression, I will be using this model for my final submission:

But before submitting, we need to take the inverse of the log transformation that we did while training the model. This is done using the np.exp() function:

Now we can create a new dataframe for submitting the results:

Make your First Submission to Kaggle

Once you have created your submission file, it will appear in the output folder which you can access on the right-hand side panel as shown below:

You can download your submission file from here. Once you have done that, just drag and drop it in the upload space provided in the Submit Predictions tab on the competition page:

End Notes

And just like that, you have made your very first Kaggle submission. Congrats!

Going forward, I encourage you to get your hands dirty with this competition and try to improve the accuracy that we have achieved here. You can go on to explore feature engineering and employ ensemble learning for better results.

Now go on and Kaggle your way to becoming a data science master!

Aniruddha Bhandari

I am on a journey to becoming a data scientist. I love to unravel trends in data, visualize it and predict the future with ML algorithms! But the most satisfying part of this journey is sharing my learnings, from the challenges that I face, with the community to make the world a better place!

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Prajwal Adhav

Hello Annirudh Sir I am Prajwal Adhav from Second year (Automobile engineering) but I wanted to switch to data science field Please tell is that possible to do so. And how to start this long journey

Siddharth Chi

Informative.. thank you

Finder lards

Hello, good job! Can you explain why is np.log required? It is not clear why it normalizes the distribution.

Show 1 reply

Hi! Log brings large values closer together. If we have data containing values like 10, 20, 50,... and then some values on the higher end like 1000, 2000, etc. On taking the log transformation we end up with values like 1, 1.3, 1.69, ..., and for the higher values we get 3, 3.3, etc. bringing all of them much closer to the median. This way we get a more normal distribution. I hope this helps.

New to Kaggle? Here’s How you can Get Started with Kaggle Competitions

Overview

Kaggle your way to the top of the Data Science World!

Table of Contents

Get Familiar with Kaggle Notebooks

Importing the Dataset in Kaggle

Let’s Explore the Data

Performing Data Preprocessing Steps

Handling any Outliers

Feature Transformation

Handling missing data

It’s Time for Feature Engineering

Preparing Data for Prediction

Let’s Make Some Predictions

Linear regression model

Ridge regression model

Make your First Submission to Kaggle

End Notes

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV

NID

1P_JAR