One of common advice machine learning experts have for beginners is – **focus on Feature Engineering**. Be it a beginner building his first model or some one who has won Kaggle competitions – following this advice works wonders for every one!

I have personally seen predictive power of several models improve significantly with application of feature engineering.

### What is Feature Engineering?

Feature engineering is the science (and art) of extracting more information from existing data. You are not adding any new data here, but you are actually making the data you already have more useful.

For example, let’s say you are trying to predict foot fall in a shopping mall based on dates. If you try and use the dates directly, you may not be able to extract meaningful insights from the data. This is because the foot fall is less affected by the day of the month than it is by the day of the week. Now this information about day of week is implicit in your data. You need to bring it out to make your model better.

This exercising of bringing out information from data in known as feature engineering.

### Process of Feature Engineering

You perform feature engineering once you have completed the first 5 steps in data exploration – Variable Identification, Univariate, Bivariate Analysis, Missing Values Imputation and Outliers Treatment. Feature engineering itself can be divided in 2 steps:

- Variable transformation
- Variable / Feature creation.

These two techniques are vital in data exploration and have a remarkable impact on the power of prediction. Let’s understand each of this step in more details.

**What is Variable Transformation?**

In data modelling, transformation refers to the replacement of a variable by a function. For instance, replacing a variable x by the square / cube root or logarithm x is a transformation. In other words, transformation is a process that changes the distribution or relationship of a variable with others.

Let’s look at the situations when variable transformation is useful. ** **

### When should we use Variable Transformation?

Below are the situations where variable transformation is a requisite:

- When we want to
**change the scale**of a variable or standardize the values of a variable for better understanding. While this transformation is a must if you have data in different scales, this transformation does not change the shape of the variable distribution

- When we can
**transform complex non-linear relationships into linear relationships**. Existence of a linear relationship between variables is easier to comprehend compared to a non-linear or curved relation. Transformation helps us to convert a non-linear relation into linear relation. Scatter plot can be used to find the relationship between two continuous variables. These transformations also improve the prediction. Log transformation is one of the commonly used transformation technique used in these situations.

**Symmetric distribution is preferred over skewed distribution**as it is easier to interpret and generate inferences. Some modeling techniques requires normal distribution of variables. So, whenever we have a skewed distribution, we can use transformations which reduce skewness. For right skewed distribution, we take square / cube root or logarithm of variable and for left skewed, we take square / cube or exponential of variables.

- Variable Transformation is also done from an
**implementation point of view**(Human involvement). Let’s understand it more clearly. In one of my project on employee performance, I found that age has direct correlation with performance of the employee i.e. higher the age, better the performance. From an implementation stand point, launching age based progamme might present implementation challenge. However, categorizing the sales agents in three age group buckets of <30 years, 30-45 years and >45 and then formulating three different strategies for each group is a judicious approach. This categorization technique is known as Binning of Variables.

### What are the common methods of Variable Transformation?

There are various methods used to transform variables. As discussed, some of them include square root, cube root, logarithmic, binning, reciprocal and many others. Let’s look at these methods in detail by highlighting the pros and cons of these transformation methods.

**Logarithm:**Log of a variable is a common transformation method used to change the shape of distribution of the variable on a distribution plot. It is generally used for reducing right skewness of variables. Though, It can’t be applied to zero or negative values as well.

**Square / Cube root:**The square and cube root of a variable has a sound effect on variable distribution. However, it is not as significant as logarithmic transformation. Cube root has its own advantage. It can be applied to negative values including zero. Square root can be applied to positive values including zero.

**Binning:**It is used to categorize variables. It is performed on original values, percentile or frequency. Decision of categorization technique is based on business understanding. For example, we can categorize income in three categories, namely: High, Average and Low.

**What is Feature / Variable Creation & its Benefits?**

Feature / Variable creation is a process to generate a new variables / features based on existing variable(s). For example, say, we have date(dd-mm-yy) as an input variable in a data set. We can generate new variables like day, month, year, week, weekday that may have better relationship with target variable. This step is used to highlight the hidden relationship in a variable:

There are various techniques to create new features. Let’s look at the some of the commonly used methods:

**Creating derived variables:**This refers to creating new variables from existing variable(s) using set of functions or different methods. Let’s look at it through “**Titanic – Kaggle competition**”. In this data set, variable age has missing values. To predict missing values, we used the salutation (Master, Mr, Miss, Mrs) of name as a new variable. How do we decide which variable to create? Honestly, this depends on business understanding of the analyst, his curiosity and the set of hypothesis he might have about the problem. Methods such as taking log of variables, binning variables and other methods of variable transformation can also be used to create new variables.

**Creating dummy variables:**One of the most common application of dummy variable is to convert categorical variable into numerical variables. Dummy variables are also called Indicator Variables. It is useful to take categorical variable as a predictor in statistical models. Categorical variable can take values 0 and 1. Let’s take a variable ‘gender’. We can produce two variables, namely, “**Var_Male**” with values 1 (Male) and 0 (No male) and “**Var_Female**” with values 1 (Female) and 0 (No Female). We can also create dummy variables for more than two classes of a categorical variables with n or n-1 dummy variables.

### For further read, here is a list of transformation / creation ideas which can be applied to your data.

### End Notes

As mentioned in the start, quality and efforts in feature engineering differentiates a good model from a bad model. In this article, we discussed the 2 steps involved in feature engineering – variable transformation and variable creation. We also looked at some common examples and applications of feature engineering and its benefits.

This ends our series on data exploration and preparation. In this series, we looked at the seven steps of data exploration in detail. The aim of this series was to provide an in depth and step by step guide to an extremely important process in data science.

Personally, I enjoyed writing this series for you and learnt from your feedback. Did you find this series useful? I would appreciate your suggestions/feedback. Please feel free to ask your questions through comments below.

I wanted to know more on feature engineering and variable selection. Thanks for this article.

Nice Article Sunil. Thanks for sharing the information.

A very comprehensive and detailed series Sunil! This is really very useful.

I however have a doubt in the part Creating dummy variables. If we create dummy variables for all categories, say n, then multicollinearity will be introduced. Rather we should create dummy variables for n-1 categories.

The article is great, please add an example, it would be of great help.

Here is a simple example using the famous iris data in r.

You can see that r takes care of the creation of dummy variables.

data(iris)

##Check the data structure,, and note that it has 4 numerical variables and one categorical variable.

##The categorical variable has 3 levels :”setosa”,”versicolor” and “virginica”..

str(iris)

##Create a linear model to predict Sepal.Length from all the remaining variables.

model1=lm(Sepal.Length~., data=iris)

## Check the model coefficients

round(summary(model1)$coef,3)

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.171 0.280 7.760 0.000

Sepal.Width 0.496 0.086 5.761 0.000

Petal.Length 0.829 0.069 12.101 0.000

Petal.Width -0.315 0.151 -2.084 0.039

Speciesversicolor -0.724 0.240 -3.013 0.003

Speciesvirginica -1.023 0.334 -3.067 0.003

From the summary, you can see that two dummy variables are created, for the categorical variable with 3 levels. As noted by Bolaka, it creates two dummy variables for handing three levels.

[…] Feature Engineering is the key to success. Everyone can use an Xgboost models but the real art and creativity lies in enhancing your features to better suit the model. […]