 Aman Preet Gulati — November 1, 2021

## Overview

In this article, we’re going to discuss employee attrition prediction i.e. predicting that employee will leave the current company (or will resign from the current company) and we will do this using several machine learning algorithms (basically 6 ML algorithms) but this article is gonna be completely step by step explanation. So let’s get started.

## Need of Employee Attrition prediction

1. Managing workforce: If the supervisors or HR came to know about some employees that they will be planning to leave the company then they could get in touch with those employees which can help them to stay back or they can manage the workforce by hiring the new alternative of those employees.
2. Smooth pipeline: If all the employees in the current project are working continuously on a project then the pipeline of that project will be smooth but if suppose one efficient asset of the project(employee) suddenly leave that company then the workflow will be not so smooth
3. Hiring Management: If HR of one particular project came to know about the employee who is willing to leave the company then he/she can manage the number of hiring and they can get the valuable asset whenever they need so for the efficient flow of work.

## Table of content

1. Importing libraries
2. Data exploration
3. Data cleaning
4. Splitting data (train test split)
5. Model development applying 6-ML algorithms
1. Logistic Regression
2. Decision tree
3. KNN
4. SVM
5. Random Forest
6. Naive Bayes
6. Saving model

## Importing libraries

```import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn import datasets
from sklearn.metrics import accuracy_score```

`attrdata = pd.read_csv("Table_1.csv")`

#### Let’s look at our dataset and plan to torture it!

`attrdata.head()`

## Data exploration

#### Dropping the index column

```attrdata.drop(0,inplace=True)
attrdata.isnull().sum()```

#### Output:

```table id                 0
name                     0
phone number             0
Location                 0
Emp. Group               0
Function                 0
Gender                   0
Tenure                   0
Tenure Grp.              0
Experience (YY.MM)       4
Marital Status           0
Age in YY.               0
Hiring Source            0
Promoted/Non Promoted    0
Job Role Match           2
Stay/Left                0
dtype: int64```

As we can see that there are null values in the Experience and Job role so we have to drop them.

`attrdata.dropna(axis=0,inplace=True)`

#### Output:

```table id                 0
name                     0
phone number             0
Location                 0
Emp. Group               0
Function                 0
Gender                   0
Tenure                   0
Tenure Grp.              0
Experience (YY.MM)       0
Marital Status           0
Age in YY.               0
Hiring Source            0
Promoted/Non Promoted    0
Job Role Match           0
Stay/Left                0
dtype: int64```

The shape of our dataset

`attrdata.shape`

Output:

`(895, 16)`

#### Let’s explore all the categorical values and visualize them

Now, we will use the value_counts function so that we can get the unique values from every categorical type of data.

Gender

```gender_dict = attrdata["Gender "].value_counts()
gender_dict```

Output:

```Male      655
Female    234
other       6
Name: Gender , dtype: int64```

#### Understanding the balancing of the Gender column visually

`attrdata['Gender '].value_counts().plot(kind='bar',color=['salmon','lightblue'],title="Count of different gender")`

Output:

Here, from the chart, it’s visible that the count of males is more than another category of the gender.

1. Male: 655
2. Female: 234
3. Other: 6

Now, let’s figure out that how gender could be the reason for employees to leave the company or to stay in.

```#Create a plot for crosstab

pd.crosstab(attrdata['Gender '],attrdata['Stay/Left']).plot(kind="bar",figsize=(10,6))
plt.title("Stay/Left vs Gender")
plt.xlabel("Stay/Left")
plt.ylabel("No of people who left based on gender")
plt.legend(["Left","Stay"])
plt.xticks(rotation=0)```

Output:

Here, from the chart it’s visible that it heavily depends on males, also we can see that it’s either male, female or others but more number of them are staying in the company.

#### Promotion (Promoted/ Non-Promoted)

```promoted_dict = attrdata["Promoted/Non Promoted"].value_counts()
promoted_dict```

Output:

```Promoted        457
Non Promoted    438
Name: Promoted/Non Promoted, dtype: int64```
`attrdata['Promoted/Non Promoted'].value_counts().plot(kind='bar',color=['salmon','lightblue'],title="Promoted and Non Promoted")`

Output:

Now, from the above chart, we can see that when it comes to Promoted and Non-Promoted employees it’s quiet in balanced number.

Now, let’s figure out that how promotion could be the reason for employees to leave the company or to stay in.

```#Create a plot for crosstab

pd.crosstab(attrdata['Promoted/Non Promoted'],attrdata['Stay/Left']).plot(kind="bar",figsize=(10,6))
plt.title("Stay/Left vs Promoted/Non Promoted")
plt.xlabel("Stay/Left")
plt.ylabel("No. of people who left/stay based on promotion")
plt.legend(["Left","Stay"])
plt.xticks(rotation=0)```

Output:

Here, from the chart, it’s visible that the ones who are not promoted are leaving the company more as compared to the ones who are promoted which is also an obvious thing likely to happen.

#### Function (Operation/ Support/ Sales)

```func_dict = attrdata["Function"].value_counts()
func_dict```

Output:

```Operation    831
Support       52
Sales         12
Name: Function, dtype: int64```
`attrdata['Function'].value_counts().plot(kind='bar',color=['salmon','lightblue'],title="Functions in organization")`

Output:

Now, we can see that majority of the function performed by employees are Operation itself then support and at the last it’s sales.

Now, let’s figure out that how function could be the reason for employees to leave the company or to stay in.

```#Create a plot for crosstab

pd.crosstab(attrdata['Function'],attrdata['Stay/Left']).plot(kind="bar",figsize=(10,6))
plt.title("Stay/Left vs Function")
plt.xlabel("Stay/Left")
plt.ylabel("No. of people who left/stay based on function of organization")
plt.legend(["Left","Stay"])
plt.xticks(rotation=0)```

Output:

Here, in the chart, we can see that the maximum number of employees are in the operation section and a high number of employees in the same section are staying in the company.

#### Hiring (Direct/ Agency/ Employee referral)

```Hiring_dict = attrdata["Hiring Source"].value_counts()
Hiring_dict```

Output:

```Direct               708
Agency               116
Employee Referral     71
Name: Hiring Source, dtype: int64```

#### Marital Status (Single/ Married/ Seperated/ Div./ NTBD)

```Marital_dict = attrdata["Marital Status"].value_counts()
print(Marital_dict)```

Output:

```Single    533
Marr.     356
Sep.        2
Div.        2
NTBD        2
Name: Marital Status, dtype: int64```

Employee group

```Emp_dict = attrdata["Emp. Group"].value_counts()
Emp_dict['other group'] = 1
print(Emp_dict)```

Output:

```B1             537
B2             275
B3              59
B0               8
B4               7
B5               4
B7               2
D2               1
C3               1
B6               1
other group      1
Name: Emp. Group, dtype: int64```

Job role match (Yes/ No)

```job_dict = attrdata["Job Role Match"].value_counts()
job_dict```

Output:

```Yes    480
No     415
Name: Job Role Match, dtype: int64```
`attrdata['Job Role Match'].value_counts().plot(kind='bar',color=['salmon','lightblue'],title="Job Role Match")`

Output:

Now, we can see that majority of the employees have their correct role in Job.

```#Create a plot for crosstab

pd.crosstab(attrdata['Job Role Match'],attrdata['Stay/Left']).plot(kind="bar",figsize=(10,6))
plt.title("Stay/Left vs Job Role Match")
plt.xlabel("Stay/Left")
plt.legend(["Left","Stay"])
plt.xticks(rotation=0)```

Output:

Here, in the above chart, we can see that the number of employees who got the correct job role is staying in the company rather than the ones who don’t have their right job role.

Tenure group

```tenure_dict = attrdata["Tenure Grp."].value_counts()
print(tenure_dict)```

Output:

```> 1 & < =3    626
< =1          269
Name: Tenure Grp., dtype: int64```

#### Now let’s visualize some continuous data

```# Its Age vs stay/left
sns.jointplot(x='Stay/Left',y='Age in YY.',data=attrdata)```

Output:

In the above graph, we can see that the ones who are having more age are staying back in the company rather than the ones who have comparatively less age.

`sns.jointplot(x='Stay/Left',y='Experience (YY.MM)',data=attrdata)`

Output:

Here in the above graph, we can see that the employees who have got more experience will be staying back in the company rather than the ones who have comparatively less experience.

Here, first, we are trying to get the correlation between variables where the dataset is not processed that’s why we are not able to see the results in the manner we want to, but in the latter half of the article, we will see the better correlation plot with the help of processed data.

```# Let's make our correlation matrix visual
corr_matrix=attrdata.corr()
fig,ax=plt.subplots(figsize=(15,10))
ax=sns.heatmap(corr_matrix,
annot=True,
linewidths=0.5,
fmt=".2f"
)```

Output:

## Data cleaning

#### Encoding the locations column (categorized)

Build a new dictionary (location) to be used to categorize data columns after values are encoded. Here, in location_dict_new we are using integer values instead of the actual region name so that our machine learning model could interpret it.

```location_dict = attrdata["Location"].value_counts()
print(location_dict)

location_dict_new = {
'Chennai':       7,
'Noida':         6,
'Bangalore':     5,
'Pune':          3,
'Lucknow':       1,
'other place':   0,
}

print(location_dict_new)```

Output:

```Chennai       255
Noida         236
Bangalore     210
Pune           55
Lucknow        20
Nagpur         14
Mumbai          4
Gurgaon         3
Kolkata         1
Name: Location, dtype: int64
{'Chennai': 7, 'Noida': 6, 'Bangalore': 5, 'Hyderabad': 4, 'Pune': 3, 'Madurai': 2, 'Lucknow': 1, 'other place': 0}```

Now we will make a function for the location column to make a new column where encoded location values will be there because our machine learning algorithm will only understand int/float values.

```def location(x):
if str(x) in location_dict_new.keys():
return location_dict_new[str(x)]
else:
return location_dict_new['other place']

data_l = attrdata["Location"].apply(location)
attrdata['New Location'] = data_l

Output:

### get_dummies()

Pandas get_dummies() function is used for manipulating data, this function is used to convert the categorical values to dummy variables and the same thing has been done with:

1. Function
2. Hiring Source
3. New Marital
4. New Gender
5. Tenure group
```gen = pd.get_dummies(attrdata["Function"])

Output:

```hr = pd.get_dummies(attrdata["Hiring Source"])

Output:

#### Marital Status

Here, in Mar() function we are using Maritial dictionary keys to convert those categorical values into acceptable type values for our ML models.

```def Mar(x):
if str(x) in Marital_dict.keys() and Marital_dict[str(x)] > 100:
return str(x)
else:
return 'other status'

data_l = attrdata["Marital Status"].apply(Mar)
attrdata['New Marital'] = data_l

Output:

Using the get_dummies to function for New Marital we are converting categorical values into dummy variables

```Mr = pd.get_dummies(attrdata["New Marital"])

Output:

Promoted/Not Promoted

Here, with the help of Promoted function, we are converting Promoted and Non promoted values into 1 and 0 respectively for encoding purposes.

```def Promoted(x):
if x == 'Promoted':
return int(1)
else:
return int(0)

data_l = attrdata["Promoted/Non Promoted"].apply(Promoted)
attrdata['New Promotion'] = data_l

Output:

#### Employee Group

Here first, we are creating a dictionary for the employee group and tagging each group to the respective integer values, later we are creating an emp() function where the encoding of the categorical values is done – similar to marital status.

```Emp_dict_new = {
'B1': 4,
'B2': 3,
'B3': 2,
'other group': 1,
}```
```def emp(x):
if str(x) in Emp_dict_new.keys():
return str(x)
else:
return 'other group'

data_l = attrdata["Emp. Group"].apply(emp)
attrdata['New EMP'] = data_l

emp = pd.get_dummies(attrdata["New EMP"])

Output:

#### Job Role Match

Here, we are using the Job() function where categorical values are Yes and No which needs to be converted into integer values i.e. 1/0 then we are assigning the New Job Role Match.

```def Job(x):
if x == 'Yes':
return int(1)
else:
return int(0)

data_l = attrdata["Job Role Match"].apply(Job)
attrdata['New Job Role Match'] = data_l

Output:

#### Gender

Here, we are using the Gen() function using gender_dict (dictionary) which will be encoded first using the dictionary keys, and then the changes will be applied to the dataset based on changes that are done.

```def Gen(x):
if x in gender_dict.keys():
return str(x)
else:
return 'other'

data_l = attrdata["Gender "].apply(Gen)
attrdata['New Gender'] = data_l

Output:

get_dummies() function for the same purposes for New gender and Tenure groups.

```gend = pd.get_dummies(attrdata["New Gender"])

Output:

```tengrp = pd.get_dummies(attrdata["Tenure Grp."])

Output:

Now, let’s concatenate the columns which are being cleaned, sorted, and manipulated by us as processed data.

```dataset = pd.concat([attrdata, hr, Mr, emp, tengrp, gen, gend], axis = 1)

Output:

`dataset.columns`

Output:

```Index(['table id', 'name', 'phone number', 'Location', 'Emp. Group',
'Function', 'Gender ', 'Tenure', 'Tenure Grp.', 'Experience (YY.MM)',
'Marital Status', 'Age in YY.', 'Hiring Source',
'Promoted/Non Promoted', 'Job Role Match', 'Stay/Left', 'New Location',
'New Marital', 'New Promotion', 'New EMP', 'New Job Role Match',
'New Gender', 'Agency', 'Direct', 'Employee Referral', 'Marr.',
'Single', 'other status', 'B1', 'B2', 'B3', 'other group', ' 1 & < =3', 'Operation', 'Sales', 'Support', 'Female', 'Male',
'other'],
dtype='object')```

Let’s drop the columns which are not important anymore

```dataset.drop(["table id", "name", "Marital Status","Promoted/Non Promoted","Function","Emp. Group","Job Role Match","Location"
,"Hiring Source","Gender ", 'Tenure', 'New Gender', 'New Marital', 'New EMP'],axis=1,inplace=True)

dataset1 = dataset.drop(['Tenure Grp.', 'phone number'], axis = 1)
dataset1.columns```

Output:

```Index(['Experience (YY.MM)', 'Age in YY.', 'Stay/Left', 'New Location',
'New Promotion', 'New Job Role Match', 'Agency', 'Direct',
'Employee Referral', 'Marr.', 'Single', 'other status', 'B1', 'B2',
'B3', 'other group', ' 1 & < =3', 'Operation', 'Sales',
'Support', 'Female', 'Male', 'other'],
dtype='object')```

As I mentioned, this is the correlation plot on the processed dataset.

```# Let's make our correlation matrix visual
corr_matrix=dataset1.corr()
fig,ax=plt.subplots(figsize=(15,10))
ax=sns.heatmap(corr_matrix,
annot=True,
linewidths=0.5,
fmt=".2f"
)```

Output:

Let’s see our target column

```# Target
"""
def Target(x):
if x in "Stay":
return False
else:
return True

data_l = dataset1["Stay/Left"].apply(Target)
dataset1['Stay/Left'] = data_l
"""

Output:

```1    Stay
2    Stay
3    Stay
4    Stay
5    Stay
Name: Stay/Left, dtype: object```

Saving the cleaned dataset into another CSV file

`dataset1.to_csv("processed table.csv")`

Now, from the processed data we have to separate the features and target column again.

```dataset = pd.read_csv("processed table.csv")
dataset = pd.DataFrame(dataset)
y = dataset["Stay/Left"]
X = dataset.drop("Stay/Left",axis=1)```

## Splitting data – Train test split

```X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=4)

Output:

## Model Development

```from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm```

#### Initializing the models

1. Logistic Regression : C: Inverse of regularization strength (float), random state: (int), solver: sag,saga,liblinear (Here, we are using liblinear).
2. Decision trees: Default parameters
3. Random forest: Default parameters
4. Gaussian Naive Bayes: Default parameters
5. K-nearest neighbors: n_neighbors=3 – we can have another number of neighbors too.
6. Support vector machines: kernel can be linear, polynomial, RBF, sigmoid. Here we are using a linear kernel function.
```lr=LogisticRegression(C = 0.1, random_state = 42, solver = 'liblinear')
dt=DecisionTreeClassifier()
rm=RandomForestClassifier()
gnb=GaussianNB()
knn = KNeighborsClassifier(n_neighbors=3)
svm = svm.SVC(kernel='linear')```

Now, from one block of code, we will check the accuracy of all the model

```for a,b in zip([lr,dt,knn,svm,rm,gnb],["Logistic Regression","Decision Tree","KNN","SVM","Random Forest","Naive Bayes"]):
a.fit(X_train,y_train)
prediction=a.predict(X_train)
y_pred=a.predict(X_test)
score1=accuracy_score(y_train,prediction)
score=accuracy_score(y_test,y_pred)
msg1="[%s] training data accuracy is : %f" % (b,score1)
msg2="[%s] test data accuracy is : %f" % (b,score)
print(msg1)
print(msg2)```

Output:

```[Logistic Regression] training data accuracy is : 0.891061
[Logistic Regression] test data accuracy is : 0.877095
[Decision Tree] training data accuracy is : 1.000000
[Decision Tree] test data accuracy is : 0.849162
[KNN] training data accuracy is : 0.804469
[KNN] test data accuracy is : 0.586592
[SVM] training data accuracy is : 0.878492
[SVM] test data accuracy is : 0.865922
[Random Forest] training data accuracy is : 1.000000
[Random Forest] test data accuracy is : 0.888268
[Naive Bayes] training data accuracy is : 0.870112
[Naive Bayes] test data accuracy is : 0.826816```

#### Model Scores (accuracy)

```model_scores={'Logistic Regression':lr.score(X_test,y_test),
'KNN classifier':knn.score(X_test,y_test),
'Support Vector Machine':svm.score(X_test,y_test),
'Random forest':rm.score(X_test,y_test),
'Decision tree':dt.score(X_test,y_test),
'Naive Bayes':gnb.score(X_test,y_test)
}
model_scores```

Output:

```{'Logistic Regression': 0.8770949720670391,
'KNN classifier': 0.5865921787709497,
'Support Vector Machine': 0.8659217877094972,
'Random forest': 0.888268156424581,
'Decision tree': 0.8491620111731844,
'Naive Bayes': 0.8268156424581006}```

Here, we can see that Logistic regression and Random forest have the best accuracy.

#### Classification Report of Random forest

```from sklearn.metrics import classification_report

rm_y_preds = rm.predict(X_test)

print(classification_report(y_test,rm_y_preds))```

Output:

```              precision    recall  f1-score   support

Left       0.81      0.77      0.79        61
Stay       0.88      0.91      0.90       118

accuracy                           0.86       179
macro avg       0.85      0.84      0.84       179
weighted avg       0.86      0.86      0.86       179```

#### Classification Report of Logistic Regression

```from sklearn.metrics import classification_report

lr_y_preds = lr.predict(X_test)

print(classification_report(y_test,lr_y_preds))```

Output:

```              precision    recall  f1-score   support

Left       0.81      0.84      0.82        61
Stay       0.91      0.90      0.91       118

accuracy                           0.88       179
macro avg       0.86      0.87      0.86       179
weighted avg       0.88      0.88      0.88       179
```

#### Model Comparison

Based on the accuracy

```model_compare=pd.DataFrame(model_scores,index=['accuracy'])
model_compare```

Output:

#### Visualize the accuracy of each model

`model_compare.T.plot(kind='bar') # (T is here for transpose)`

Output:

Yes, we can see that Random Forest has 1% better accuracy than Logistic regression but Random Forest is an overfitted model hence we will select Logistic regression.

#### Feature importance

These “coef’s” tell how much and in what way did each one of them contribute to predicting the target variable

`# Logistic regression`
```feature_dict=dict(zip(dataset.columns,list(lr.coef_)))
feature_dict```

This is a type of Model-driven Exploratory data analysis.

Output:

```{'Unnamed: 0': -0.000369638613255323,
'Experience (YY.MM)': 0.13976826898148667,
'Age in YY.': -0.01962203036690505,
'Stay/Left': 0.024627352503955716,
'New Location': 0.09666512057880872,
'New Promotion': 2.7533361395664873,
'New Job Role Match': -0.31312348489873837,
'Agency': -0.020270885128439962,
'Direct': 0.2047872081147744,
'Employee Referral': 0.38796318779035893,
'Marr.': -0.5042922677790985,
'Single': -0.012278081923656183,
'other status': -0.22031197156191443,
'B1': 0.17649315194124499,
'B2': -0.09367965851449515,
'B3': 0.008891316222766718,
'other group': -0.12993373252552307,
' 1 & < =3': -0.05949484149777713,
'Operation': -0.018200464251483424,
'Sales': -0.05091185616314798,
'Support': 0.12249364800096194,
'Female': -0.1951606022872554,
'Male': -0.055940207626112265}```

#### Visualize feature importance

```feature_df=pd.DataFrame(feature_dict,index=)
feature_df.T.plot(kind="bar",legend=False,title="Feature Importance")```

Output:

As we can see that “New promotion” column has the highest feature importance.

#### Approach -1

Logistic Regression model because it has the best accuracy as well it is neither overfitted nor under fitted

```import pickle

# Save the trained model as a pickle string.
saved_model = pickle.dumps(lr)

# Use the loaded pickled model to make predictions
lr_from_pickle.predict(X_test)```

Output:

```array(['Left', 'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Left',
'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Stay',
'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Left',
'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Left', 'Stay', 'Stay',
'Stay', 'Left', 'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay',
'Left', 'Stay', 'Stay', 'Left', 'Stay', 'Stay', 'Left', 'Stay',
'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Left',
'Left', 'Stay', 'Left', 'Left', 'Stay', 'Stay', 'Left', 'Stay',
'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Left',
'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Left',
'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay',
'Left', 'Left', 'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay',
'Stay', 'Left', 'Stay', 'Left', 'Stay', 'Left', 'Left', 'Stay',
'Left', 'Stay', 'Stay', 'Left', 'Left', 'Left', 'Stay', 'Stay',
'Left', 'Stay', 'Left', 'Stay', 'Left', 'Left', 'Left', 'Stay',
'Stay', 'Stay', 'Stay', 'Left', 'Left', 'Stay', 'Left', 'Stay',
'Stay', 'Left', 'Left', 'Stay', 'Stay', 'Left', 'Stay', 'Stay',
'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Stay', 'Stay',
'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay',
'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Left', 'Left',
'Left', 'Left', 'Stay', 'Stay', 'Left', 'Left', 'Stay', 'Left',
'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay',
'Stay', 'Stay', 'Stay'], dtype=object)```

#### Approach – 2

```# loading dependency
import joblib

# saving our model - model - model , filename - model_lr
joblib.dump(lr , 'model_lr')

# opening the file- model_jlib

# check prediction
m_jlib.predict(X_test) # similar output```

Output:

Okay, so that’s a wrap from my side!

#### Endnotes

Thank you for reading my article 🙂

I hope you guys will like this step-by-step learning of Employee Attrition prediction using machine learning.

Here you can access my other articles which are published on Analytics Vidhya -Blogathon (link)

If you got stuck somewhere you can connect with me on LinkedIn, refer to this link  