Tutorial to data preparation for training machine learning model

Vidhi 18 Dec, 2020 • 3 min read

This article was published as a part of the Data Science Blogathon.

Introduction

It happens quite often that we do not have all the features/input variables in one file, but spread across multiple files. Not just that, it might even need external information to make appropriate joins to prepare the final data in the right format for model training. In this article, we will see how data preparation and feature engineering (specifically target variable engineering) are the most time-consuming step of modeling pipeline.

Let’s get started and see what is the objective?

We will first do the necessary imports.

We have 3 datasets namely, station data, trip data and weather data.

Station Data:

Let’s see how station data looks like:

We have 71 stations in total over 5 cities:

Trip data:

We have seen 2 files till now and understand that the target variable is not given to us as is. We need to create the target variable which is net rate of change in trips at a station in a given hour. For this, we need to create start and end hour features from start and end date respectively.

Once we have the net_rate feature created as our target variable, we need to start bringing all the relevant attributes into one dataframe.

So, we have merged station and trip data to create an intermediate df named as, df_station_netrate:

Let’s see how this intermediate df ‘df_station_netrate’ looks like:

Weather Data:

Note that we have yet not studied weather data, so let’s see how to bring weather information to our intermediate df ‘df_station_netrate’ created above.

Missing values in weather data:

External Data:

Note that we have zipcode in weather data and lat-long information in intermediate df ‘df_station_netrate’, so we need zipcode to lat-long mapping to be able to merge weather data with station data.

We resort to open-source data for this information: https://public.opendatasoft.com/explore/dataset/us-zip-code-latitude-and-longitude/export/

Now that we have the necessary mapping, we merge it with weather data:

Final Data:

Now, we can merge this weather information with df_station_netrate, as our final training data preparation step:

We shall also create date type features as part of the feature engineering process:

Finally, we have one flat file to train a model. So, we have seen that training data preparation and feature engineering are the most arduous task in creating a machine learning model.

Data and the source code is kept here.

blogathon

Vidhi 18 Dec 2020

Beginner Data Exploration Machine Learning Python Structured Data

Tutorial to data preparation for training machine learning model

Introduction

Let’s get started and see what is the objective?

Station Data:

Trip data:

Weather Data:

External Data:

Final Data:

Frequently Asked Questions

Responses From Readers

Write for us

Machine Learning

Tutorial to data preparation for training machine learning model

Introduction

Let’s get started and see what is the objective?

Station Data:

Trip data:

Weather Data:

External Data:

Final Data:

Frequently Asked Questions

Responses From Readers

Write for us

Machine Learning

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

NaÃ¯ve Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices