Pratik Nabriya — Published On May 30, 2022 and Last Modified On May 31st, 2022
Intermediate Machine Learning Python

This article was published as a part of the Data Science Blogathon.

Introduction on Machine Learning

Last month, I participated in a Machine learning approach Hackathon hosted on Analytics Vidhya’s Datahack platform. Over a weekend, more than 600 participants competed to build and improve their solutions and climb the leaderboard. In this article, I will be sharing my hackathon experience – what worked, what didn’t work, and what I learned from it.

Objective of Machine Learning Approach

Forecasting the demand for car rentals on an hourly basis based on past data.

Loading and Exploring Data

Importing necessary libraries

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
from sklearn.metrics import mean_squared_error

Importing Train and Test Data 

train_df = pd.read_csv('train_E1GspfA.csv')
test_df = pd.read_csv('test_6QvDdzb.csv')

Data Overview 

Data Overview 
Data Overview  Image 2

So basically, we are provided with the train data comprising of hourly rental car demand spanning from mid-Aug 2018 to Feb 2021 (approx 31 months). Our job is to make a prediction on the next 13 months of the test data (i.e. from Mar 2021 to Mar 2022).

Data Cleaning and Pre-processing

Check for the datatype of the data in the columns
Data Cleaning and Pre-processing
Data Cleaning and Pre-processing 2

This also tells us that there are no null values in the data.

Check for duplicate entries


As seen, no duplicate data is present.

From visual analysis, I believe that the train dataframe is already sorted, but let’s just do it.
Sorting dataframe by  ‘date’ and ‘hour’ —

train_df = train_df.sort_values(by = ['date', 'hour'])

Now that we have organised the data, let’s proceed to the next stage.

Feature Engineering

As seen we don’t don’t really have many features. ‘hour’ data we already have, let’s see what information we can extract from the ‘date’ column

train_df['date'] = pd.to_datetime(train_df['date'])
train_df['Month'] =
train_df['Week'] =
train_df['Day'] =
train_df['Quarter'] =
train_df['Weekday'] =
train_df['Dayofyear'] =

This is how our dataframe looks after adding newly engineered features

Feature Engineering 

Repeat same feature engineering steps on test data

test_df['date'] = pd.to_datetime(test_df['date'])
test_df['Month'] =
test_df['Week'] =
test_df['Day'] =
test_df['Quarter'] =
test_df['Weekday'] =
test_df['Dayofyear'] =

Note that here ‘Day’ means what day of the month it is, while ‘Dayofyear’, as the name suggests, represents what day of the year it is.

Now that we have extracted additional features from ‘date’, can we do something about the ‘hour’ feature? Can we somehow aggregate it to form a new feature?

Generating new feature ‘time2’ based on what part of the day the ‘hour’ falls in

def time_day(t):
    if t in [12, 13, 14]:
        return 'Afteroon'
    elif t in [15, 16, 17]:
        return 'Evening'
    elif t in [18, 19, 20]:
        return 'Late_evening'
    elif t in [21, 22, 23]:
        return 'Night'
    elif t in [0, 1, 2]:
        return 'Late_night'
    elif t in [3, 4, 5]:
        return 'Dawn'
    elif t in [6, 7, 8]:
        return 'Early_morning'
    elif t in [9, 10, 11]:
        return 'Morning'
train_df['time2'] = train_df['hour'].apply(lambda x:time_day(x))
test_df['time2'] = test_df['hour'].apply(lambda x:time_day(x))

So far we have extracted the following features from ‘date’ and ‘hour’ data – ‘Month’, ‘Week’, ‘Day’, ‘Quarter’, ‘Weekday’, ‘Dayofyear’, ‘time2’.

Since the beginning, we are assuming that it is possible to forecast the future demand based on the past data. But, can the past data really tell us about the future? Let’s test our hypothesis using Exploratory analysis.

Hypothesis testing – Exploratory Data Analysis

Let’s first divide our train data

1. From March’19 to February’20
2. From March’20 to February’21

# Mar-19 to Feb-20
train_df_19 = train_df[(train_df['date'] >= '01-03-2019') & (train_df['date'] <= '29-02-2020')]
# Mar-20 to Feb-21
train_df_20 = train_df[(train_df['date'] >= '01-03-2020') & (train_df['date'] <= '28-02-2021')]

Plotting demand by ‘Hour’, ‘Month’, ‘Week’, ‘Day’, ‘Quarter’, ‘Weekday’, ‘Dayofyear’, ‘time2’ —

Hypothesis testing - Exploratory Data Analysis 
Hypothesis testing - Exploratory Data Analysis  2
Hypothesis testing - Exploratory Data Analysis 3
Hypothesis testing - Exploratory Data Analysis 4
Hypothesis testing - Exploratory Data Analysis 5
Hypothesis testing - Exploratory Data Analysis 6
Hypothesis testing - Exploratory Data Analysis 7
Hypothesis testing - Exploratory Data Analysis 8

If you observe these lineplots, there is a lot of similarity in the demand trend in the years 2019 and 2020 (March to Feb). Especially if we see ‘Hour’, ‘Quarter’, ‘Weekday’, ‘Dayofyear’ and ‘time2’ plots, the peaks and troughs tend to concur.

Now that we are convinced that the newly engineered features along with the existing features  can prove useful in predicting the demand for the subsequent year, let’s proceed to the modelling stage.

Hypothesis testing – Exploratory Data Analysis

Let’s observe our train and test data —

Hypothesis testing - Exploratory Data Analysis 


Hypothesis testing - Exploratory Data Analysis 2

Drop redundant columns from train and test data —

X_train = train_df.drop(['date'], axis = 1)
X_test = test_df.drop(['date'], axis = 1)
y_train = train_df['demand']

Split training data further into training and validation data

from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.30, shuffle = False)
print(X_train.shape, y_train.shape)
print(X_val.shape, y_val.shape)
training and validation data 

Before we proceed, recall we had handcrafted a new variable ‘time2’? It is a categorical variable and machine learning models, in general, cannot consume categorical variables directly. So it needs to be encoded. We can do label encoding here, but usually, I do not prefer label encoding unless the variable has some inherent ordering. So what other option do we have? Let’s do response encoding!

Mean-encoding ‘time2’ variable

agg_df = pd.DataFrame(X_train.groupby(['time2']).agg({'demand':'mean'})).reset_index()
agg_df['demand'] = round(agg_df['demand'], 2)
agg_dict = dict(agg_df.values)
X_train['time2'] = X_train['time2'].apply(lambda x:agg_dict[x])
X_val['time2'] = X_val['time2'].apply(lambda x:agg_dict[x])
Mean-encoding 'time2' variable  

The above dictionary represents the numerical values that these categorical variables take upon encoding.

Let’s train some tree-based ensembles and perform testing on the validation data in order to select the best model. The metric for scoring is Root mean-squared error (RMSE) —

from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from sklearn.metrics import mean_squared_error
X_train = X_train.drop(['demand'], axis = 1)
X_val = X_val.drop(['demand'], axis = 1)
models = [XGBRegressor(), LGBMRegressor(), CatBoostRegressor()]
for model in models:, y_train)
    y_train_pred = model.predict(X_train)
    y_val_pred = model.predict(X_val)
    train_error = mean_squared_error(y_train, y_train_pred)
    validation_error = mean_squared_error(y_val, y_val_pred)
    print("Model: ", model)
    print("Train RMSE:", round(np.sqrt(train_error), 4))
    print("Validation RMSE:", round(np.sqrt(validation_error), 4))
Train RMSE Validation RMSE
XGBRegressor 25.5833 37.9044
LGBMRegressor 29.981 36.5979
CatBoostRegressor 28.8737 36.6496

(Here I am displaying the result of the three models in the tabular format for the sake of brevity).

The comparative analysis of the performance of the models on the validation set reveals that the Light-GBM regressor model is the best among the three. So we select LGBMRegressor.

Now repeat all the above steps on the complete training dataset and make the prediction on the test dataset —

X_train = train_df.copy()
X_train = X_train.drop(['date'], axis = 1)
# encoding time2 variable
agg_df = pd.DataFrame(X_train.groupby(['time2']).agg({'demand':'mean'})).reset_index()
agg_df['demand'] = round(agg_df['demand'], 2)
agg_dict = dict(agg_df.values)
X_train['time2'] = X_train['time2'].apply(lambda x:agg_dict[x])
X_test['time2'] = X_test['time2'].apply(lambda x:agg_dict[x])
# lgbm regressor
y_train = X_train['demand']
X_train = X_train.drop(['demand'], axis = 1)
lgbm = LGBMRegressor(), y_train)
y_train_pred = lgbm.predict(X_train)
y_test_pred = lgbm.predict(X_test)
train_error = mean_squared_error(y_train, y_train_pred)
print("Train RMSE:", round(np.sqrt(train_error), 4))

Finally, make a submission with the predicted values —

submit_df = pd.read_csv('sample_4E0BhPN.csv')
submit_df['demand'] = y_test_pred
submit_df.to_csv('lgbm_baseline.csv', index = False)
# display the submission file 

Post-Modelling Analysis

Let’s see what all features contributed towards predicting the rental cabs demand and their relative importance.

feat_df = pd.DataFrame({'Features': X_train.columns, 'Relative importance': lgbm.feature_importances_})
imp_feat_df = feat_df.sort_values('Relative importance', ascending = False)
plt.figure(figsize = (8, 6))
sns.barplot(x = 'Relative importance', y = 'Features', data = imp_feat_df)
Post-Modelling Analysis

Turns out that ‘Dayofyear’ is the most important feature followed by ‘Day’ and ‘hour’. Also, the feature ‘Quarter’ is negligibly significant for our prediction task, so it can be discarded.

Conclusion on Machine learning 

In the beginning, were provided with the hourly data of the Car rentals from mid-Aug 2018 to Feb 2021. We did some high-level analysis, followed by feature engineering. Using newly extracted features we tested our hypothesis by performing exploratory analysis. Then finally we trained a regression model to predict the hourly demand of the rental Cars from Mar 2021 to Mar 2022.

Key Takeaways on machine learning 

  • Feature engineering is good and can drastically improve the model performance, but overdoing it can lead to overfitting resulting in bad performance on the test data.
  • Make sure you keep train and test (or validation) data separate while encoding categorical variables to avoid data leakage issues.
  • The rank on the public leaderboard is often deceptive. Here your prediction is evaluated partially so you may end up with a sub-optimal model. Always make sure you test your models on validation data first before deciding which model to finalize.
  • Here we used all baseline models. So once the best model is selected, we can try hyperparameter tuning to see if the model performance improves further. 

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

About the Author

Pratik Nabriya

Pratik Nabriya is a skilled data professional currently employed with an Analytics & AI firm based out of Noida. He is proficient in Machine learning, Deep learning, NLP, Time-Series Analysis, SQL, Python, Data analytics & visualization, and is familiar with working in a Cloud environment. In his spare time, he loves to compete in Hackathons, and write technical articles.

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *