# Simple Methods to deal with Categorical Variables in Predictive Modeling

Sunil 16 Jun, 2022

## Introduction

Categorical variables are known to hide and mask lots of interesting information in a data set. It’s crucial to learn the methods of dealing with such variables. If you won’t, many a times, you’d miss out on finding the most important variables in a model. It has happened with me. Initially,Â I used to focusÂ more on numerical variables. Hence, never actually got an accurate model. But, later I discoveredÂ my flaws and learnt the art of dealing with such variables.

If you are a smart data scientist, you’d hunt down the categorical variables in the data set, and dig out as much information as you can. Right?Â But if you are a beginner, you might not know the smart ways to tackle such situations.Â Don’t worry. I am here to help you out.

After receiving a lot of requests on this topic, I decided to write down aÂ clear approach to help you improveÂ your models using categorical variables.

Note: This article is best written for beginners and newly turned predictive modelers. If you are an expert, you are welcome to share some useful tipsÂ of dealing with categorical variablesÂ in the comments section below.

## What are the key challenges with categorical variable?

I’ve had nasty experienceÂ dealing with categorical variables. I remember working on a data set, where it took me more than 2 days just to understand the science of categorical variables. I’ve facedÂ many such instances where error messages didn’t let me move forward. Even, my proven methodsÂ didn’t improve the situation.

But during this process, I learnt how to solve theseÂ challenges. I’d like to share all the challenges I faced while dealing with categorical variables. You’d find:

• A categorical variableÂ has too many levels.Â This pulls down performance level of the model. For example, a cat. variable “zip code” would haveÂ numerousÂ levels.
• A categorical variableÂ has levelsÂ whichÂ rarely occur. Many of these levels have minimal chance of making a real impact on model fit. For example, a variable ‘disease’ might have someÂ levels whichÂ would rarely occur.
• There is one level whichÂ always occurs i.e. for most of the observations in data set there is only one level. Variables with such levelsÂ fail to make aÂ positive impact on model performance due toÂ very low variation.
• If the categorical variable is masked, it becomesÂ a laborious taskÂ to decipher its meaning. Such situations are commonly found inÂ data science competitions.
• You can’t fit categorical variables into a regression equation in their raw form. They must be treated.
• Most of the algorithms (or ML libraries) produce better result with numerical variable. In python, library “sklearn” requires features in numerical arrays. Look at the below snapshot. I have applied random forest using sklearn library on titanic data set (only two features sex and pclass are taken as independent variables). It has returned an error because featureÂ “sex” is categorical and has not been converted to numerical form.

Python Code:

## ProvenÂ methods to deal with Categorical Variables

Here are some methods I used to dealÂ with categorical variable(s). A trick to get goodÂ result from these methods is ‘Iterations’. You must know that all these methods may not improve resultsÂ in allÂ scenarios, but we should iterate our modeling process withÂ different techniques. Later, evaluate the model performance. Below are the methods:

### Convert to Number

• Convert to number: As discussed above, some ML libraries do not take categoricalÂ variables as input. Thus, we convert them into numerical variables. Below are the methods to convert a categorical (string) input to numerical nature:
• Label Encoder:Â It is used to transform non-numerical labels to numerical labels (or nominal categorical variables). Numerical labels are alwaysÂ between 0 and n_classes-1.Â A common challenge withÂ nominal categorical variable is that, it may decrease performance of a model. For example: We have two features “age” (range: 0-80) and “city” (81 different levels). Now, when we’llÂ apply label encoder to ‘city’ variable, it will represent ‘city’ with numeric values range from 0 to 80. The ‘city’ variable is now similar to ‘age’ variable since both will have similar data points, which is certainly not a right approach.
• Convert numeric bins to number:Â Let’s say, bins of a continuous variable areÂ available in theÂ data set (shown below).
Above, you can see that variable “Age” has bins (0-17, 17-25, 26-35 …). We can convert these bins into definite numbersÂ using theÂ following methods:
• Using label encoder for conversion. But, these numerical bins will be treated same as multiple levels of non-numeric feature. Hence,Â wouldn’t provide any additional information
• Create a new feature usingÂ mean or mode (most relevant value) of each age bucket. It would comprise ofÂ additional weight for levels.
• Create two new features, one for lower bound of age and another for upper bound. In this method, we’ll obtain more information about these numerical bins compare to earlier two methods.

### Combine Levels

• Combine levels: To avoid redundantÂ levels in aÂ categorical variable and to deal with rare levels, we can simplyÂ combine the different levels. There are various methods of combining levels. Here are commonly used ones:
• Using Business Logic: It is one of the most effective method of combining levels. It makes sense also toÂ combine similar levels into similar groups based on domain or business experience. For example, we can combine levels of a variable “zip code” at state or district level. This willÂ reduce the number of levels and improve the model performance also.
• Using frequency or response rate:Â Combining levels based on business logic is effective but we may always not have theÂ domain knowledge. Imagine, you are given a data set from Aerospace Department, US Govt. How would you apply business logic here? In suchÂ cases, we combine levels by consideringÂ the frequency distribution or response rate.
• To combine levels using their frequency, we first look at the frequency distribution of of each level and combine levels having frequency less than 5% of total observation (5% is standard but you can change it based on distribution).Â This is an effective method to deal with rare levels.
• We can also combine levels by consideringÂ the response rate of each level. We can simply combine levels having similar response rate into same group.
• Finally, you can also look at both frequency and response rate to combine levels. You first combine levels based on response rate then combine rare levels to relevant group.

### Dummy Coding

• Dummy Coding:Â Dummy coding is a commonly used method forÂ converting a categorical input variable into continuous variable. ‘Dummy’, as the name suggests is a duplicate variable which represents one level of a categorical variable.Â Presence of a level is represent by 1 and absence is represented by 0. For every level present, one dummy variable will be created. Look at the representation below to convert aÂ categorical variable using dummy variable.
Note: Assume, we have 500 levels in categorical variables. Then, should we create 500 dummy variables? If you can automate it, very well. Or else, I’dÂ suggest you to first, reduce the levels byÂ using combining methods and thenÂ use dummy coding. This wouldÂ save your time.This method is also known as “One Hot Encoding“.

## End Notes

In this article, we discussed the challenges you might face while dealing with categorical variable in modelling. We also discussed various methods to overcome those challenge and improve model performance. I’ve used Python for demonstrationÂ purpose and kept the focus of article for beginners.

In order to keep article simple and focused towards beginners, I have not described advanced methods like “feature hashing”. I will take it up as a separate article in itself in future.

You must understand that these methods are subject to the data sets in question. I’ve seen even the most powerful methods failing to bring model improvement. Whereas, a basic approach can do wonders. Hence, you must understand the validity of these models in context to your data set. If you still face any trouble, I shall help you out in comments section below.