Aman Preet Gulati — Updated On November 17th, 2021
Beginner Guide Machine Learning Project Python
This article was published as a part of the Data Science Blogathon


In this article, we’re going to discuss employee attrition prediction i.e. predicting that employee will leave the current company (or will resign from the current company) and we will do this using several machine learning algorithms (basically 6 ML algorithms) but this article is gonna be completely step by step explanation. So let’s get started.

Employee attrition prediction
Image Source: Google Images

Need of Employee Attrition prediction

  1. Managing workforce: If the supervisors or HR came to know about some employees that they will be planning to leave the company then they could get in touch with those employees which can help them to stay back or they can manage the workforce by hiring the new alternative of those employees.
  2. Smooth pipeline: If all the employees in the current project are working continuously on a project then the pipeline of that project will be smooth but if suppose one efficient asset of the project(employee) suddenly leave that company then the workflow will be not so smooth
  3. Hiring Management: If HR of one particular project came to know about the employee who is willing to leave the company then he/she can manage the number of hiring and they can get the valuable asset whenever they need so for the efficient flow of work.

Table of content

  1. Importing libraries
  2. Data exploration
  3. Data cleaning
  4. Splitting data (train test split)
  5. Model development applying 6-ML algorithms
    1. Logistic Regression
    2. Decision tree
    3. KNN
    4. SVM
    5. Random Forest
    6. Naive Bayes
  6. Saving model

Importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn import datasets
from sklearn.metrics import accuracy_score

Reading the dataset

attrdata = pd.read_csv("Table_1.csv")

Let’s look at our dataset and plan to torture it!




looking at our dataset for employee attrition prediction

Data exploration

Dropping the index column



table id                 0
name                     0
phone number             0
Location                 0
Emp. Group               0
Function                 0
Gender                   0
Tenure                   0
Tenure Grp.              0
Experience (YY.MM)       4
Marital Status           0
Age in YY.               0
Hiring Source            0
Promoted/Non Promoted    0
Job Role Match           2
Stay/Left                0
dtype: int64

As we can see that there are null values in the Experience and Job role so we have to drop them.



table id                 0
name                     0
phone number             0
Location                 0
Emp. Group               0
Function                 0
Gender                   0
Tenure                   0
Tenure Grp.              0
Experience (YY.MM)       0
Marital Status           0
Age in YY.               0
Hiring Source            0
Promoted/Non Promoted    0
Job Role Match           0
Stay/Left                0
dtype: int64

The shape of our dataset



(895, 16)

Let’s explore all the categorical values and visualize them

Now, we will use the value_counts function so that we can get the unique values from every categorical type of data.


gender_dict = attrdata["Gender "].value_counts()


Male      655
Female    234
other       6
Name: Gender , dtype: int64

Understanding the balancing of the Gender column visually

attrdata['Gender '].value_counts().plot(kind='bar',color=['salmon','lightblue'],title="Count of different gender")



Understanding Gender column for employee attrition prediction

Here, from the chart, it’s visible that the count of males is more than another category of the gender.

  1. Male: 655
  2. Female: 234
  3. Other: 6

Now, let’s figure out that how gender could be the reason for employees to leave the company or to stay in.

#Create a plot for crosstab

pd.crosstab(attrdata['Gender '],attrdata['Stay/Left']).plot(kind="bar",figsize=(10,6))
plt.title("Stay/Left vs Gender")
plt.ylabel("No of people who left based on gender")



Crosstab for Gender column

Here, from the chart it’s visible that it heavily depends on males, also we can see that it’s either male, female or others but more number of them are staying in the company.

Promotion (Promoted/ Non-Promoted)

promoted_dict = attrdata["Promoted/Non Promoted"].value_counts()


Promoted        457
Non Promoted    438
Name: Promoted/Non Promoted, dtype: int64
attrdata['Promoted/Non Promoted'].value_counts().plot(kind='bar',color=['salmon','lightblue'],title="Promoted and Non Promoted")



Plot for Promoted/ Non-Promoted for employee attrition prediction

Now, from the above chart, we can see that when it comes to Promoted and Non-Promoted employees it’s quiet in balanced number.

Now, let’s figure out that how promotion could be the reason for employees to leave the company or to stay in.

#Create a plot for crosstab

pd.crosstab(attrdata['Promoted/Non Promoted'],attrdata['Stay/Left']).plot(kind="bar",figsize=(10,6))
plt.title("Stay/Left vs Promoted/Non Promoted")
plt.ylabel("No. of people who left/stay based on promotion")



Plot showing how promotion could be the reason for employees to leave the company or to stay in.

Here, from the chart, it’s visible that the ones who are not promoted are leaving the company more as compared to the ones who are promoted which is also an obvious thing likely to happen.

Function (Operation/ Support/ Sales)

func_dict = attrdata["Function"].value_counts()


Operation    831
Support       52
Sales         12
Name: Function, dtype: int64
attrdata['Function'].value_counts().plot(kind='bar',color=['salmon','lightblue'],title="Functions in organization")



Now, we can see that majority of the function performed by employees are Operation itself then support and at the last it’s sales.

Now, let’s figure out that how function could be the reason for employees to leave the company or to stay in.

#Create a plot for crosstab

plt.title("Stay/Left vs Function")
plt.ylabel("No. of people who left/stay based on function of organization")



Plot showing No. of people who have left/stayed based on function

Here, in the chart, we can see that the maximum number of employees are in the operation section and a high number of employees in the same section are staying in the company.

Hiring (Direct/ Agency/ Employee referral)

Hiring_dict = attrdata["Hiring Source"].value_counts()


Direct               708
Agency               116
Employee Referral     71
Name: Hiring Source, dtype: int64

Marital Status (Single/ Married/ Seperated/ Div./ NTBD)

Marital_dict = attrdata["Marital Status"].value_counts()


Single    533
Marr.     356
Sep.        2
Div.        2
NTBD        2
Name: Marital Status, dtype: int64

Employee group

Emp_dict = attrdata["Emp. Group"].value_counts()
Emp_dict['other group'] = 1


B1             537
B2             275
B3              59
B0               8
B4               7
B5               4
B7               2
D2               1
C3               1
B6               1
other group      1
Name: Emp. Group, dtype: int64

Job role match (Yes/ No)

job_dict = attrdata["Job Role Match"].value_counts()


Yes    480
No     415
Name: Job Role Match, dtype: int64
attrdata['Job Role Match'].value_counts().plot(kind='bar',color=['salmon','lightblue'],title="Job Role Match")



Job role plot for employee attrition prediction

Now, we can see that majority of the employees have their correct role in Job.

#Create a plot for crosstab

pd.crosstab(attrdata['Job Role Match'],attrdata['Stay/Left']).plot(kind="bar",figsize=(10,6))
plt.title("Stay/Left vs Job Role Match")



employee stayed/left based on Job role match for employee attrition prediction

Here, in the above chart, we can see that the number of employees who got the correct job role is staying in the company rather than the ones who don’t have their right job role.

Tenure group

tenure_dict = attrdata["Tenure Grp."].value_counts()


> 1 & < =3    626
< =1          269
Name: Tenure Grp., dtype: int64

Now let’s visualize some continuous data

# Its Age vs stay/left
sns.jointplot(x='Stay/Left',y='Age in YY.',data=attrdata)



Visualizing Age vs stay/left

In the above graph, we can see that the ones who are having more age are staying back in the company rather than the ones who have comparatively less age.

sns.jointplot(x='Stay/Left',y='Experience (YY.MM)',data=attrdata)



Visualizing Experience vs Stay/Left for employee attrition prediction

Here in the above graph, we can see that the employees who have got more experience will be staying back in the company rather than the ones who have comparatively less experience.

Here, first, we are trying to get the correlation between variables where the dataset is not processed that’s why we are not able to see the results in the manner we want to, but in the latter half of the article, we will see the better correlation plot with the help of processed data.

# Let's make our correlation matrix visual



Correltion Matrix plot for employee attrition prediction

Data cleaning

Encoding the locations column (categorized)

Build a new dictionary (location) to be used to categorize data columns after values are encoded. Here, in location_dict_new we are using integer values instead of the actual region name so that our machine learning model could interpret it.

location_dict = attrdata["Location"].value_counts()

location_dict_new = {
    'Chennai':       7,
    'Noida':         6,
    'Bangalore':     5,
    'Hyderabad':     4,
    'Pune':          3,
    'Madurai':       2,
    'Lucknow':       1,
    'other place':   0,



Chennai       255
Noida         236
Bangalore     210
Hyderabad      62
Pune           55
Madurai        29
Lucknow        20
Nagpur         14
Vijayawada      6
Mumbai          4
Gurgaon         3
Kolkata         1
Name: Location, dtype: int64
{'Chennai': 7, 'Noida': 6, 'Bangalore': 5, 'Hyderabad': 4, 'Pune': 3, 'Madurai': 2, 'Lucknow': 1, 'other place': 0}

Now we will make a function for the location column to make a new column where encoded location values will be there because our machine learning algorithm will only understand int/float values.

def location(x):
    if str(x) in location_dict_new.keys():
        return location_dict_new[str(x)]
        return location_dict_new['other place']
data_l = attrdata["Location"].apply(location)
attrdata['New Location'] = data_l



Adding a new location column
A new location column has been added


Pandas get_dummies() function is used for manipulating data, this function is used to convert the categorical values to dummy variables and the same thing has been done with:

  1. Function
  2. Hiring Source
  3. New Marital
  4. New Gender
  5. Tenure group
gen = pd.get_dummies(attrdata["Function"])



manipulating data
hr = pd.get_dummies(attrdata["Hiring Source"])



manipulating data output

Marital Status

Here, in Mar() function we are using Maritial dictionary keys to convert those categorical values into acceptable type values for our ML models.

def Mar(x):
    if str(x) in Marital_dict.keys() and Marital_dict[str(x)] > 100:
        return str(x)
        return 'other status'
data_l = attrdata["Marital Status"].apply(Mar)
attrdata['New Marital'] = data_l



Output after manipulating data

Using the get_dummies to function for New Marital we are converting categorical values into dummy variables

Mr = pd.get_dummies(attrdata["New Marital"])



manipulating data

Promoted/Not Promoted

Here, with the help of Promoted function, we are converting Promoted and Non promoted values into 1 and 0 respectively for encoding purposes.

def Promoted(x):
    if x == 'Promoted':
        return int(1)
        return int(0)

data_l = attrdata["Promoted/Non Promoted"].apply(Promoted)
attrdata['New Promotion'] = data_l


Viewing first few rows of attrdata

Employee Group

Here first, we are creating a dictionary for the employee group and tagging each group to the respective integer values, later we are creating an emp() function where the encoding of the categorical values is done – similar to marital status.

Emp_dict_new = {
    'B1': 4,
    'B2': 3,
    'B3': 2,
    'other group': 1,
def emp(x):
    if str(x) in Emp_dict_new.keys():
        return str(x)
        return 'other group'
data_l = attrdata["Emp. Group"].apply(emp)
attrdata['New EMP'] = data_l

emp = pd.get_dummies(attrdata["New EMP"])



Output after Variable encoding

Job Role Match

Here, we are using the Job() function where categorical values are Yes and No which needs to be converted into integer values i.e. 1/0 then we are assigning the New Job Role Match.

def Job(x):
    if x == 'Yes':
        return int(1)
        return int(0)
data_l = attrdata["Job Role Match"].apply(Job)
attrdata['New Job Role Match'] = data_l



output after manipulating data


Here, we are using the Gen() function using gender_dict (dictionary) which will be encoded first using the dictionary keys, and then the changes will be applied to the dataset based on changes that are done.

def Gen(x):
    if x in gender_dict.keys():
        return str(x)
        return 'other'

data_l = attrdata["Gender "].apply(Gen)
attrdata['New Gender'] = data_l



Viewing data after manipulation

get_dummies() function for the same purposes for New gender and Tenure groups.

gend = pd.get_dummies(attrdata["New Gender"])



Manipulation output for employee attrition prediction
tengrp = pd.get_dummies(attrdata["Tenure Grp."])



Viewing the first few rows of tengrp

Now, let’s concatenate the columns which are being cleaned, sorted, and manipulated by us as processed data.

dataset = pd.concat([attrdata, hr, Mr, emp, tengrp, gen, gend], axis = 1)



concatenate the columns which are being cleaned, sorted, and manipulated for employee attrition prediction


Index(['table id', 'name', 'phone number', 'Location', 'Emp. Group',
       'Function', 'Gender ', 'Tenure', 'Tenure Grp.', 'Experience (YY.MM)',
       'Marital Status', 'Age in YY.', 'Hiring Source',
       'Promoted/Non Promoted', 'Job Role Match', 'Stay/Left', 'New Location',
       'New Marital', 'New Promotion', 'New EMP', 'New Job Role Match',
       'New Gender', 'Agency', 'Direct', 'Employee Referral', 'Marr.',
       'Single', 'other status', 'B1', 'B2', 'B3', 'other group', ' 1 & < =3', 'Operation', 'Sales', 'Support', 'Female', 'Male',

Let’s drop the columns which are not important anymore

dataset.drop(["table id", "name", "Marital Status","Promoted/Non Promoted","Function","Emp. Group","Job Role Match","Location"
              ,"Hiring Source","Gender ", 'Tenure', 'New Gender', 'New Marital', 'New EMP'],axis=1,inplace=True)

dataset1 = dataset.drop(['Tenure Grp.', 'phone number'], axis = 1)


Index(['Experience (YY.MM)', 'Age in YY.', 'Stay/Left', 'New Location',
       'New Promotion', 'New Job Role Match', 'Agency', 'Direct',
       'Employee Referral', 'Marr.', 'Single', 'other status', 'B1', 'B2',
       'B3', 'other group', ' 1 & < =3', 'Operation', 'Sales',
       'Support', 'Female', 'Male', 'other'],

As I mentioned, this is the correlation plot on the processed dataset.

# Let's make our correlation matrix visual



correlation plot on the processed dataset for employee attrition prediction

Let’s see our target column

# Target 
def Target(x):
    if x in "Stay":
        return False
        return True
data_l = dataset1["Stay/Left"].apply(Target)
dataset1['Stay/Left'] = data_l


1    Stay
2    Stay
3    Stay
4    Stay
5    Stay
Name: Stay/Left, dtype: object

Saving the cleaned dataset into another CSV file

dataset1.to_csv("processed table.csv")

Now, from the processed data we have to separate the features and target column again.

dataset = pd.read_csv("processed table.csv")
dataset = pd.DataFrame(dataset)
y = dataset["Stay/Left"]
X = dataset.drop("Stay/Left",axis=1)

Splitting data – Train test split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=4)



Splitting data - Train test split | employee attrition prediction

Model Development

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm

Initializing the models

  1. Logistic Regression : C: Inverse of regularization strength (float), random state: (int), solver: sag,saga,liblinear (Here, we are using liblinear).
  2. Decision trees: Default parameters
  3. Random forest: Default parameters
  4. Gaussian Naive Bayes: Default parameters
  5. K-nearest neighbors: n_neighbors=3 – we can have another number of neighbors too.
  6. Support vector machines: kernel can be linear, polynomial, RBF, sigmoid. Here we are using a linear kernel function.
lr=LogisticRegression(C = 0.1, random_state = 42, solver = 'liblinear')
knn = KNeighborsClassifier(n_neighbors=3)
svm = svm.SVC(kernel='linear')

Now, from one block of code, we will check the accuracy of all the model

for a,b in zip([lr,dt,knn,svm,rm,gnb],["Logistic Regression","Decision Tree","KNN","SVM","Random Forest","Naive Bayes"]):,y_train)
    msg1="[%s] training data accuracy is : %f" % (b,score1)
    msg2="[%s] test data accuracy is : %f" % (b,score)


[Logistic Regression] training data accuracy is : 0.891061
[Logistic Regression] test data accuracy is : 0.877095
[Decision Tree] training data accuracy is : 1.000000
[Decision Tree] test data accuracy is : 0.849162
[KNN] training data accuracy is : 0.804469
[KNN] test data accuracy is : 0.586592
[SVM] training data accuracy is : 0.878492
[SVM] test data accuracy is : 0.865922
[Random Forest] training data accuracy is : 1.000000
[Random Forest] test data accuracy is : 0.888268
[Naive Bayes] training data accuracy is : 0.870112
[Naive Bayes] test data accuracy is : 0.826816

Model Scores (accuracy)

model_scores={'Logistic Regression':lr.score(X_test,y_test),
             'KNN classifier':knn.score(X_test,y_test),
             'Support Vector Machine':svm.score(X_test,y_test),
             'Random forest':rm.score(X_test,y_test),
              'Decision tree':dt.score(X_test,y_test),
              'Naive Bayes':gnb.score(X_test,y_test)


{'Logistic Regression': 0.8770949720670391,
 'KNN classifier': 0.5865921787709497,
 'Support Vector Machine': 0.8659217877094972,
 'Random forest': 0.888268156424581,
 'Decision tree': 0.8491620111731844,
 'Naive Bayes': 0.8268156424581006}

Here, we can see that Logistic regression and Random forest have the best accuracy.

Classification Report of Random forest

from sklearn.metrics import classification_report

rm_y_preds = rm.predict(X_test)



              precision    recall  f1-score   support

        Left       0.81      0.77      0.79        61
        Stay       0.88      0.91      0.90       118

    accuracy                           0.86       179
   macro avg       0.85      0.84      0.84       179
weighted avg       0.86      0.86      0.86       179

Classification Report of Logistic Regression

from sklearn.metrics import classification_report

lr_y_preds = lr.predict(X_test)



              precision    recall  f1-score   support

        Left       0.81      0.84      0.82        61
        Stay       0.91      0.90      0.91       118

    accuracy                           0.88       179
   macro avg       0.86      0.87      0.86       179
weighted avg       0.88      0.88      0.88       179


Model Comparison

Based on the accuracy




Model Comparison for employee attrition prediction

Visualize the accuracy of each model

model_compare.T.plot(kind='bar') # (T is here for transpose)



Visualizing the accuracy of each model | employee attrition rate

Yes, we can see that Random Forest has 1% better accuracy than Logistic regression but Random Forest is an overfitted model hence we will select Logistic regression.

Feature importance

These “coef’s” tell how much and in what way did each one of them contribute to predicting the target variable

# Logistic regression

This is a type of Model-driven Exploratory data analysis.


{'Unnamed: 0': -0.000369638613255323,
 'Experience (YY.MM)': 0.13976826898148667,
 'Age in YY.': -0.01962203036690505,
 'Stay/Left': 0.024627352503955716,
 'New Location': 0.09666512057880872,
 'New Promotion': 2.7533361395664873,
 'New Job Role Match': -0.31312348489873837,
 'Agency': -0.020270885128439962,
 'Direct': 0.2047872081147744,
 'Employee Referral': 0.38796318779035893,
 'Marr.': -0.5042922677790985,
 'Single': -0.012278081923656183,
 'other status': -0.22031197156191443,
 'B1': 0.17649315194124499,
 'B2': -0.09367965851449515,
 'B3': 0.008891316222766718,
 'other group': -0.12993373252552307,
 ' 1 & < =3': -0.05949484149777713,
 'Operation': -0.018200464251483424,
 'Sales': -0.05091185616314798,
 'Support': 0.12249364800096194,
 'Female': -0.1951606022872554,
 'Male': -0.055940207626112265}

Visualize feature importance

feature_df.T.plot(kind="bar",legend=False,title="Feature Importance")



Visualizing the feature importance

As we can see that “New promotion” column has the highest feature importance.

Saving the best model

Approach -1

Logistic Regression model because it has the best accuracy as well it is neither overfitted nor under fitted

import pickle

# Save the trained model as a pickle string.
saved_model = pickle.dumps(lr)

# Load the pickled model
lr_from_pickle = pickle.loads(saved_model)

# Use the loaded pickled model to make predictions


array(['Left', 'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Left',
       'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Stay',
       'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Left',
       'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Left', 'Stay', 'Stay',
       'Stay', 'Left', 'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay',
       'Left', 'Stay', 'Stay', 'Left', 'Stay', 'Stay', 'Left', 'Stay',
       'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Left',
       'Left', 'Stay', 'Left', 'Left', 'Stay', 'Stay', 'Left', 'Stay',
       'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Left',
       'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Left',
       'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay',
       'Left', 'Left', 'Left', 'Stay', 'Left', 'Stay', 'Stay', 'Stay',
       'Stay', 'Left', 'Stay', 'Left', 'Stay', 'Left', 'Left', 'Stay',
       'Left', 'Stay', 'Stay', 'Left', 'Left', 'Left', 'Stay', 'Stay',
       'Left', 'Stay', 'Left', 'Stay', 'Left', 'Left', 'Left', 'Stay',
       'Stay', 'Stay', 'Stay', 'Left', 'Left', 'Stay', 'Left', 'Stay',
       'Stay', 'Left', 'Left', 'Stay', 'Stay', 'Left', 'Stay', 'Stay',
       'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Stay', 'Stay',
       'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay', 'Stay',
       'Left', 'Stay', 'Stay', 'Stay', 'Stay', 'Left', 'Left', 'Left',
       'Left', 'Left', 'Stay', 'Stay', 'Left', 'Left', 'Stay', 'Left',
       'Stay', 'Stay', 'Stay', 'Left', 'Stay', 'Stay', 'Stay', 'Stay',
       'Stay', 'Stay', 'Stay'], dtype=object)

Approach – 2

# loading dependency
import joblib

# saving our model - model - model , filename - model_lr
joblib.dump(lr , 'model_lr')

# opening the file- model_jlib
m_jlib = joblib.load('model_lr')

# check prediction
m_jlib.predict(X_test) # similar output



employee attrition prediction array

Okay, so that’s a wrap from my side!


Thank you for reading my article 🙂

I hope you guys will like this step-by-step learning of Employee Attrition prediction using machine learning.

Here’s the repo link to this article.

Here you can access my other articles which are published on Analytics Vidhya -Blogathon (link)

If you got stuck somewhere you can connect with me on LinkedIn, refer to this link

About me

Greeting to everyone, I’m currently working in TCS and previously I worked as a Data Science Associate Analyst in Zorba Consulting India. Along with full-time work, I’ve got an immense interest in the same field i.e. Data Science along with its other subsets of Artificial Intelligence such as, Computer Vision, Machine learning, and Deep learning feel free to collaborate with me on any project on the above-mentioned domains (LinkedIn).

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

About the Author

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *