Feature Engineering Techniques to follow in Machine Learning

pavan Last Updated : 14 Dec, 2023

10 min read

This article was published as a part of the Data Science Blogathon

Feature Engineering Techniques — Photo by Firmbee.com on Unsplash

What is a feature, and why do we need it engineered? In general, all machine learning algorithms use some form of input data to generate outputs. This input data consists of feature engineering techniques, which are in the form of structured columns. Algorithms require features with a specific characteristic to function better. The need for feature engineering arises in this situation.

I believe that feature engineering efforts are primarily motivated by two objectives:

Creates input data that is compatible with the machine learning algorithm’s requirements.
Improvement of ML model performance.

The features you use influence more than everything else the result. No algorithm alone, to my knowledge, can supplement the information gain given by correct feature engineering.
— Luca Massaron

According to a Forbes survey, data scientists spend 80% of their time preparing data:

This metric demonstrates the significance of feature engineering in data science. As a result, I decided to write this article, which summarises the main techniques of feature engineering and provides brief descriptions of each.

I also included some simple Python scripts for each technique. To use them, you must first import the Pandas and Numpy libraries.

Some of the techniques listed below may work better with specific algorithms or datasets, while others may be helpful in all cases. This post does not intend to delve too deeply into this topic. It is possible to publish a post for each of the methods listed below, and I had attempted to keep the explanations brief and informative.

Practising different techniques on different datasets and observing their effect on model performance is the best way to gain expertise in feature engineering.

Let’s dive into techniques:
Frequently Asked Questions

Let’s dive into techniques:

Imputation

Missing values are one of the most common issues that arise when attempting to prepare data for machine learning. Human errors, interruptions in the data flow, privacy concerns, and other factors could be the reason for missing values. Missing values, for whatever reason, have an impact on the performance of machine learning models.

Some machine learning platforms automatically drop rows with missing values during the model training phase, which reduces model performance due to the reduced training size. On the other hand, most algorithms reject datasets with missing values and return an error.

The most straightforward way to deal with missing values is to remove the rows or the entire column. There is no optimal dropping criterion, however, you can take 80% as an example and drop the rows and columns with missing values greater than that proportion.

threshold_value = 0.8

#Dropping columns with missing value rate higher than threshold
data = data[data.columns[data.isnull().mean() < threshold_value]]

#Dropping rows with missing value rate higher than threshold
data = data.loc[data.isnull().mean(axis=1) < threshold_value]

Numerical Imputation

Imputation is performed to dropping since it retains data size. However, there is a significant selection of what you replace with the missing numbers. I recommend starting by contemplating a suitable default value for missing values in the field. For example, if you have a column with only 1 and NA, the NA rows likely correspond to 0. For example, if you have a column that shows the “customer visit count in the last month,” replace the missing numbers with 0 if it is a reasonable option.

Another cause of missing numbers is combining tables of various sizes, and in this situation, replacing 0 may be appropriate as well.

Instead of providing a default value for missing values, I believe the optimum imputation method is to use the column medians. The averages of the columns are susceptible to outlier values, although the medians are more stable in this regard.

#filling all missing values with 0
data = data.fillna(0)
# filling missing values with median of columns
data = data.fillna(data.median)

Categorical Imputation

To handle categorical variables, replacing missing values of columns with the mode is a good choice. If there is no dominant value and the features are uniform, imputing a category like “unknown” is sensible, whereas your imputation is likely to converge a random selection.

#Max fill function for categorical columns
data['column_name'].fillna(data['column_name'].value_counts()
.idxmax(), inplace=True)

Binning

This can be applied to both numerical and categorical data.

#Numerical Bin example
Value     Bin
0-30    -> Low
31-70   -> Med
71-100  -> High
#Categorical Bin example
Value       Bin
Spain   -> Europe
Italy   -> Europe
Chile   -> South America

The main reason for binning is to make the model more robust and to prevent overfitting; however, it comes at a cost in terms of performance. Every time you throw something away, you give up information and make your data more regular. (For more information, see regularisation in machine learning.)

The main motto of the binning process is the trade-off between performance and overfitting. Binding, in my opinion, maybe redundant for some types of algorithms for numerical columns, except for some obvious overfitting cases, due to its effect on model performance.

However, for categorical columns, labels with low frequencies are likely to harm the robustness of statistical models. Assigning a customary category to these less frequent values thus contributes to the model’s robustness. For example, your data set contains 10,000 rows, it might be a good idea to group labels with a count of less than 100 into a new category called “New.”

#Numerical Binning Example
data['bin'] = pd.cut(data['value'], bins=[0,30,70,100], labels=["Low", "Mid", "High"]) 
   value   bin
0      2   Low
1     45   Mid
2      7   Low
3     85  High
4     28   Low
#Categorical Binning Example 
     Country
0      Spain
1      Chile
2  Australia
3      Italy
4     Brazil
conditions = [ data['Country'].str.contains('Spain'), data['Country'].str.contains('Italy'),
 data['Country'].str.contains('Chile'),
    data['Country'].str.contains('Brazil')]
choices = ['Europe', 'Europe', 'South America', 'South America']
data['Continent'] = np.select(conditions, choices, default='Other') 
     Country      Continent
0      Spain         Europe
1      Chile  South America
2  Australia          Other
3      Italy         Europe
4     Brazil  South America

Outliers Handling

Before discussing how to handle outliers, I’d like to point out that visualising the data is the best way to detect outliers. All other statistical methodologies are prone to error, whereas visualising outliers allows for a more precise decision.

Statistical methodologies, as previously stated, are less precise, but they have an advantage in that they are fast. In this section, I will discuss two approaches to dealing with outliers. These will detect them through the use of standard deviation and percentiles.

Outlier Detection with Standard Deviation

If a value’s distance from the average is more than x * standard deviation, it is considered an outlier. So, what should x be?

There is no simple solution for x, but a value between 2 and 4 seems reasonable.

#Dropping the outlier rows with standard deviation
factor = 3
upper_limt = data['column'].mean () + data['column'].std () * factor
lower_limt = data['column'].mean () - data['column'].std () * factor
data = data[(data['column']  lower_limt)]

Furthermore, the z-score can be substituted for the formula above. To standardise the distance between a value and the mean in the Z-score (or standard score) use standard deviation.

Outlier Detection with Percentiles

The use of percentiles is another mathematical method for detecting outliers. As an outlier, you can take a certain percentage of the value from the top or bottom. The main point here is to reset the percentage value, which is determined by the distribution of your data, as previously mentioned.

Furthermore, a simple error is to use percentiles based on the data range. In other words, if your data ranges from 0 to 100, the values between 96 and 100 do not constitute your top 5%. The top 5% of features are those that are less than the 95th percentile of data.

#Dropping the outlier rows with Percentiles
upper_lim = data['column'].quantile(.95)
lower_lim = data['column'].quantile(.05)
data = data[(data['column']  lower_lim)]

Log Transform

The logarithm transformation (or log transform) is a famous mathematical transformation in feature engineering. What are the advantages of log transformation:

It helps in handling skewed data, and after transformation, the distribution becomes more similar to normal.
The magnitude order of the data varies within the range of the data. For example, the difference between 20 and 25 is not equal to the ages of 60 and 65. In terms of years, it is similar, but 5 years difference in young ages is known as a higher magnitude difference. Log transform helps to normalize the magnitude difference like this.
The effect of outliers is decreased and the model becomes robust.

Note: If you apply log transform on data that has only positive values, you will receive an error. Also, before transforming your data, you can add 1 to it. As a result, you assure that the transformation’s output is positive.

Log(X+1)

#Log Transform Example
data = pd.DataFrame({'value':[2,45, -23, 85, 28, 2, 35, -12]})
data['log+1'] = (data['value']+1).transform(np.log)
#Negative Values Handling
#Note that the values are different
data['log'] = (data['value']-data['value'].min()+1) .transform(np.log) 
    value  log(x+1)     log(x-min(x)+1)
0      2   1.09861          3.25810
1     45   3.82864          4.23411
2    -23       nan          0.00000
3     85   4.45435          4.69135
4     28   3.36730          3.95124
5      2   1.09861          3.25810
6     35   3.58352          4.07754
7    -12       nan          2.48491

One-Hot Encoding

It is one of the most common encoding methods in Machine Learning. Features spread across columns to multiple flag columns and assign 0 or 1 to them. These values express the relation between grouped and encoded columns.

Categorical data is challenging to understand for algorithms. This encoding changes to numerical format and allows to group categorical data without losing information.

If you have N unique values in the column, it is good to map them to N-1 binary columns where missing values can deduct from other columns. If all the column values are 0, then the missing value must be equal to 1. It is the reason why it is known as One-Hot Encoding.

Here’s an example of the get_dummies function of pandas that map all column values to multiple features.

encoded = pd.get_dummies(data['column'])
data = data.join(encoded).drop('column', axis=1)

Splitting Feature

In terms of ML, splitting features is the best way to make them more valuable. The dataset almost always contains string columns, which violates tidy data rules. By isolating the informative bits of a column and transforming them into new features:

We make it possible for machine learning algorithms to understand them.
Allow them to be categorised and grouped.
By exposing potential information, you can improve the model’s performance.

Splitting features is a smart choice, but there is no one-size-fits-all solution. How to split the column is determined by the column’s attributes. Let’s start with a couple of examples. For starters, here’s a simple split method for a regular name column:

data.name
0  Luther N. Gonzalez
1    Charles M. Young
2        Terry Lawson
3       Taylor White
4      Thomas Logsdon#Extracting first names
data.name.str.split(" ").map(lambda x: x[0])
0     Luther
1    Charles
2      Terry
3     Taylor 
4     Thomas#Extracting last names
data.name.str.split(" ").map(lambda x: x[-1])
0    Gonzalez
1       Young
2      Lawson
3       White
4     Logsdon

The first and last items in the example above handle names longer than two words, making the function robust for corner cases when processing strings like that.

To extract a string segment between two characters split method is helpful. The following example is using two split functions in a row to understand the above case.

#String extraction example
data.title.head()
0                      Toy Story (1995)
1                        Jumanji (1995)
2               Grumpier Old Men (1995)
3              Waiting to Exhale (1995)
4    Father of the Bride Part II (1995)
data.title.str.split("(", n=1, expand=True)[1].str.split(")", n=1, expand=True)[0]
0    1995
1    1995
2    1995
3    1995
4    1995

Grouping

The row represents every instance, and columns consist of different features of each example. This kind of data is known as Tidy.

We group the data by example, and each instance is known by only one row.

Grouping | Feature Engineering Techniques

Photo by @thiszun from Pexels

The main aim of the group by is to determine the aggregation functions of the features. Average and sum fractions are usually convenient for numerical features, whereas it is complicated for categorical data.

I suggest two ways of aggregating categorical columns

The first option is to choose the label with the highest frequency. In other words, this is the max operation for categorical columns, but ordinary max functions rarely return this value; instead, a lambda function is required.

data.groupby('id').agg(lambda x: x.value_counts().index[0])

After performing one-hot encoding, the second alternative is to use a group by function. This technique keeps all of the data and, in the meantime, converts the encoded column from categorical to numerical.

Scaling

The numerical properties of the dataset, in most circumstances, do not have a fixed range and differ from one another. In reality, expecting the age and income columns to have the same range is absurd. But how can these two columns be compared from the standpoint of machine learning?

This issue is solved by scaling. After a scaling operation, the continuous features become similar in terms of range. Although this step is not a must for many algorithms, it’s still a good idea to do so. Distance-based algorithms like k-NN and k-Means, on the other hand, require scaled continuous features as model input.

Normalization

All values are scaled in a specified range between 0 and 1 via normalisation (or min-max normalisation). This modification does not influence the feature’s distribution, but it does exacerbate the effects of outliers due to lower standard deviations. As a result, it’s a good idea to deal with outliers before normalisation.

data = pd.DataFrame({'feature':[2, 45, -23, 85, 28, 2, 35, -12]})
data['normalized'] = (data['feature'] - data['feature'].min()) / (data['feature'].max() - data['feature'].min()) 
     value  normalized
0      2        0.23
1     45        0.63
2    -23        0.00
3     85        1.00
4     28        0.47
5      2        0.23
6     35        0.54
7    -12        0.10

Standardization

Standardization (also known as z-score normalisation) is the process of scaling values while accounting for standard deviation. If the standard deviation of features differs, the range of those features will likewise differ. The effect of outliers in the characteristics is reduced as a result.

data = pd.DataFrame({'feature':[2,45, -23, 85, 28, 2, 35, -12]})
data['standardized'] = (data['feature'] - data['feature'].mean()) / data['feature'].std() 
    value  standardized
0      2         -0.52
1     45          0.70
2    -23         -1.23
3     85          1.84
4     28          0.22
5      2         -0.52
6     35          0.42
7    -12         -0.92

Date Extraction

Even though date columns typically give helpful information about the model goal, they are either ignored as an input or used in an illogical manner by machine learning algorithms. This may be because dates come in a variety of formats, making them difficult for algorithms to interpret, even when simplified to a format like “01–01–2020.”

If you don’t manipulate the date columns, it’s very difficult for a machine learning system to build an ordinal relationship between the data. Here are three forms of date preparation that I recommend:

Parts of the date are extracted and placed in other columns: Year, month, day, and so forth.
Extracting the period between the current date and the columns in years, months, days, and other units.
Extracting specific information from the date, such as the weekday’s name, whether it’s a weekend or not, if it’s a holiday or not, and so on.

When you convert the date column into the extracted columns, as shown above, the information contained inside them is revealed, and machine learning algorithms can readily comprehend it.

from datetime import date

data = pd.DataFrame({'date':['01-01-2017','04-12-2008','23-06-2010','25-08-2005','20-02-2020',]})

#Transform string to date

data['date'] = pd.to_datetime(data.date, format="%d-%m-%Y")

#Extracting Year

data['year'] = data['date'].dt.year

#Extracting Month

data['month'] = data['date'].dt.month

#Extracting passed years since the date

data['passed_years'] = date.today().year - data['date'].dt.year

#Extracting passed months since the date

data['passed_months'] = (date.today().year - data['date'].dt.year) * 12 + date.today().month - data['date'].dt.month

#Extracting the weekday name of the date

data['day_name'] = data['date'].dt.day_name() 

       date   year month  passed_years  passed_months day_name

0 2017-01-01 2017 1 4 54 Sunday

1 2008-12-04 2008 12 13 151 Thursday

2 2010-06-23 2010 6 11 133 Wednesday

3 2005-08-25 2005 8 16 191 Thursday

4 2020-02-20 2020 2 1 17 Thursday

Conclusion

These techniques aren’t magical, so try out and get the key information from features that helps in better performance of the model.

I hope you’ve found this article useful, and that might help you in the feature engineering process.

Frequently Asked Questions

Q1.What is “binning” in feature engineering?

Binning in feature engineering is like sorting data into groups to make it easier for computers to understand.

Q2.What is feature engineering in image processing?

In image processing, feature engineering is about helping computers recognize important things in pictures, like edges, shapes, and colors. It’s like teaching computers to understand what’s in the images.

Q3. Are there tools for feature engineering?

Yes, there are tools like Featuretools and TPOT that make feature engineering faster and easier

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

pavan

Embarking on a transformative odyssey through the realms of AI, ML, and NLP, I've woven a tapestry of experience over three dynamic years. Amidst the digital symphony, I now find myself enraptured by the artistry of Generative AI, sculpting the future of innovation. As I dance with colossal language models, each keystroke becomes a brushstroke, painting the canvas of possibility in this ever-evolving technological landscape.

Beginner Data Cleaning Machine Learning Pandas Python

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Feature Engineering Techniques to follow in Machine Learning

Table of contents

Let’s dive into techniques:

Imputation

Numerical Imputation

Categorical Imputation

Binning

Outliers Handling

Outlier Detection with Standard Deviation

Outlier Detection with Percentiles

Log Transform

Log(X+1)

One-Hot Encoding

Splitting Feature

Grouping

Scaling

Normalization

Standardization

Date Extraction

Conclusion

Frequently Asked Questions

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state