In this blog, we’ll go over everything you need to know about Logistic Regression to get started and build a model in Python. If you’re new to machine learning and have never built a model before, don’t worry; after reading this, I’m confident you’ll be able to do so.

For those who are new to this, let’s start with a basic understanding of machine learning before moving on to Logistic Regression.

** This article was published as a part of the Data Science Blogathon**.

- What is Machine Learning?
- Supervised Learning
- What is Logistic Regression?
- Logistic/Sigmoid Function
- Step by step Implementation of Logistic Regression Model in Python
- Importing Libraries
- Loading Dataset
- Understanding the Data for Logistic Regression
- Data Cleaning
- Exploratory Data Analysis before creating a Logistic Regression Model
- Feature Engineering
- Feature Scaling/Normalization
- Class Imbalance
- Build and Train Logistic Regression model in Python
- Model Evaluation Metrics
- Hyperparameter Optimization for the Logistic Regression Model
- Build Model using Optimal Values of Hyperparameters
- Model Evaluation

- Frequently Asked Questions

In simple terms, the Machine learning model uses algorithms in which the machine learns from the data just like humans learn from their experiences. Machine learning allows computers to find hidden insights without being explicitly programmed.

Based on the output type and task done, machine learning models are classified into the following types

Logistic Regression falls under the Supervised Learning type. Let’s learn more about it.

** **It’s a type of Machine Learning that uses labelled data from the past. Models are trained using already labelled samples.

**Example**: You have past data of the football premier league and based on that data and previous match results you predict which team will win the next game.

Supervised learning is further divided into two types-

**Regression**– target/output variable is continuous.**Classification**– target/output variable is categorical.

Logistic Regression is a Classification model. It helps to make predictions where the output variable is categorical. With this let’s understand Logistic Regression in detail.

As previously stated, Logistic Regression is used to solve classification problems. **Models are trained on historical labelled datasets** and aim to predict which category new observations will belong to.

Below are few examples of binary classification problems which can be solved using logistic regression-

- The probability of a political candidate winning or losing the next election.
- Whether a machine in manufacturing will stop running in a few days or not.
- Filtering email as spam or not spam.

Logistic regression is well suited when we need to predict a binary answer (only 2 possible values like yes or no).

The term logistic regression comes from “**Logistic Function,**” which is also known as “**Sigmoid Function”.** Let us learn more about it.

The sigmoid function, commonly known as the logistic function, predicts the likelihood of a binary outcome occurring. **The function takes any value** **and converts it to a number between 0** **and 1**. The Sigmoid Function is a machine learning activation function that is used to introduce non-linearity to a machine learning model.

When we plot the above equation, we get **S shape** curve like below.

The key point from the above graph is that no matter what value of x we use in the logistic or sigmoid function, the **output **along the vertical axis will always be **between 0 and 1**.

When the result of the sigmoid function is greater than 0.5, we classify the label as class 1 or positive class; if it’s less than 0.5, we can classify it as a negative class or 0.

In Logistic Regression, iterative optimization algorithms like Gradient Descent or probabilistic methods like Maximum Likelihood are used to get the “best fit” S curve.

Logistic regression is derived from Linear regression bypassing its output value to the sigmoid function and the equation for the Linear Regression is –

In Linear Regression we try to find the best-fit line by changing m and c values from the above equation and y (output) can take any values from -infinity to +infinity. But, Logistic regression predicts the probability of outcome which can be between 0 to 1. So, to convert those values between 0 to 1 we use the sigmoid function.

after getting our output value we need to see how our model works, for that, we need to calculate the loss function. The loss function tells us how much our predicted output differ from the actual output. A good model should have less loss value. Let’s see how to calculate the loss function.

When y=1, the predicted y value should be close to 1 to reduce the loss. Now Let’s see when our actual output value is 0.

When y=0, the predicted y value should be close to 0 to reduce the loss.

Let’s move on to the implementation of the Logistic Regression model now that we’ve covered the basics.

Based on parameters in the dataset, we will build a **Logistic Regression model** in Python to **predict whether an employee will be promoted or not**.

For everyone, promotion or appraisal cycles are the most exciting times of the year. Final promotions are only disclosed after employees have been evaluated on a variety of criteria, which causes a delay in transitioning to new responsibilities. We will build a Machine Learning model to predict who is qualified for promotion to speed up the process.

You can get more understanding of the problem statement and **download the dataset from Supervised learning.**

We’ll begin by loading the necessary libraries for creating a Logistic Regression model.

```
import numpy as np
import pandas as pd
#Libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
#We will use sklearn for building logistic regression model
from sklearn.linear_model import LogisticRegression
```

We’ll use the HR Analytics dataset from the link above. We’ll start by loading the dataset from the downloaded CSV file with the code below.

**Python Code:**

It’s always a good idea to learn more about data after loading it, such as the shape of the data and statistical information about the columns in a dataset. We can achieve all of this with the code below :

```
#shape of dataset
print("shape of dataframe is : ", data.shape)
# summary of data
data.info()
#Get Statistical details of data
data.describe()
```

There are a total of 14 variables in this dataset, with a total of 54808 observations. “**is_promoted**” is our **Target Variable**, which has two categories encoded as 1 (promoted) and 0 (not promoted) rest all are input features. In addition, we can observe that our dataset contains both numerical and Categorical features.

Data cleaning is a crucial stage in the data preprocessing process. We’ll **remove columns with only one unique value** because their variance will be 0 and they won’t help us anticipate anything.

Let’s see whether there are any columns that

only have one unique value.

```
#Checking the unique value counts in columns
featureValues={}
for d in data.columns.tolist():
count=data[d].nunique()
if count==1:
featureValues[d]=count
# List of columns having same 1 unique value
cols_to_drop= list(featureValues.keys())
print("Columns having 1 unique value are :n",cols_to_drop)
```

This signifies that there isn’t any column having only 1 unique value.

We’ll now **drop the employee_id column because it’s merely a unique identifier**, and

then verify each field in the dataset for null value percentages.

```
#Drop employee_id column as it is just a unique id
data.drop("employee_id",inplace=True,axis=1)
#Checking null percentage
data.isnull().mean()*100
```

**previous_year_rating **and **education **both features **have null values**. As a result, we will impute those null values instead of dropping them. Following our examination of those columns, we discovered that –

- For rows with a null
**previous_year_rating**, we can see that their length of service is 1, which could be why they don’t have a previous year rating. As a result,**we’ll use 0 to impute null values**. - For the
**education**column, we will**impute null values with mode**.

```
#fill missing value
data["previous_year_rating"]= data["previous_year_rating"].fillna(0)
#change type to int
data["previous_year_rating"]= data["previous_year_rating"].astype("int")
#Find out mode value for education
data["education"].mode()
#fill missing value with mode
data["education"]= data["education"].fillna("Bachelor's")
```

**Now, we do not have any null or missing values** in our data. So, let’s proceed to our next step. There are no null or missing values in our data now. So, let’s go on to the next step.

Getting insights from data and visualizing them is an important stage in machine learning since it provides us with a better view of features and their relationships.

Let’s look at the target variable’s distribution in the dataset.

```
# cchart for distribution of target variable
fig= plt.figure(figsize=(10,3) )
fig.add_subplot(1,2,1)
a= data["is_promoted"].value_counts(normalize=True).plot.pie()
fig.add_subplot(1,2,2)
churnchart=sns.countplot(x=data["is_promoted"])
plt.tight_layout()
plt.show()
```

We can observe from the above charts that, promoted employee data is less than non-promoted employee data, indicating that **there is a class imbalance** because class 0 has more data points or observations than class.

Let’s visualize if there is any relationship between the target variable and other variables.

```
# Visualize relationship between promoted and other features
fig= plt.figure(figsize=(10,5) )
fig.add_subplot(1,3,1)
ar_6=sns.boxplot(x=data["is_promoted"],y=data["length_of_service"])
fig.add_subplot(1,3,2)
ar_6=sns.boxplot(x=data["is_promoted"],y=data["avg_training_score"])
fig.add_subplot(1,3,3)
ar_6=sns.boxplot(x=data["is_promoted"],y=data["previous_year_rating"])
plt.tight_layout()
plt.show()
```

For an employee If the avg_training_score value is higher then the chances of getting promoted are more.

We will plot correlations between different variables using a heatmap.

```
#correlation between features
corr_plot = sns.heatmap(data.corr(),annot = True,linewidths=3 )
plt.title("Correlation plot")
plt.show()
```

None of the features is highly correlated with each other except age and length of the service.

In feature engineering, we apply domain expertise to produce new features from raw data, or we convert or encode features. We’ll encode categorical features or **make dummy features** out of them in this section.

```
#Converting Categorical columns into one hot encoding
data["gender"]=data["gender"].apply(lambda x: 1 if x=="m" else 0)
#list of columns
cols = data.select_dtypes(["object"]).columns
#Create dummy variables
ds=pd.get_dummies(data[cols],drop_first=True)
ds
#concat newly created columns with original dataframe
data=pd.concat([data,ds],axis=1)
#Drop original columns
data.drop(cols,axis=1,inplace=True)
```

We will divide the dataset into two subsets: train and test. To perform the train-test split, we’ll use Scikit-learn machine learning.

**Train subset**– we will use this subset to fit/train the model**Test subset**– we will use this subset to evaluate our model

```
from sklearn.model_selection import train_test_split
#split data into dependent variables(X) and independent variable(y) that we would predict
y = data.pop("is_promoted")
X = data
#Let’s split X and y using Train test split
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=42,train_size=0.8)
#get shape of train and test data
print("train size X : ",X_train.shape)
print("train size y : ",y_train.shape)
print("test size X : ",X_test.shape)
print("test size y : ",y_test.shape)
```

After splitting the dataset, we have 43846 observations in the training subset and 10962 in the test subset.

After diving into the dataset let’s move on to the next phase of feature scaling.

**Why Feature scaling is important?**

As previously stated, Logistic Regression uses Gradient Descent as one of the approaches for obtaining the best result, and feature scaling helps to **speed up** **the Gradient Descent** **convergence process**. When we have features that vary greatly in magnitude, the algorithm assumes that features with a large magnitude are more relevant than those with a small magnitude. As a result, when we train the model, those characteristics become more important.

To put all features into the same range, regardless of their relevance, we need feature scaling.

**Feature Scaling Techniques**

We bring all the features into the same range using feature scaling. There are many ways to do feature scaling like normalization, standardization, robust scaling, min-max scaling, etc. But here we will discuss the Standardization technique that we are going to apply to our features.

In standardization, scale features to have a mean of 0 and a standard deviation of 1. It does not scale to a preset range. Scale features using the formula below:

z = (x – u) / s

where **u** **is the mean** of the training samples and **s** **is a standard deviation** of the training samples.

Let’s see how to do feature scaling in python using Scikit-learn.

```
#Feature scaling
from sklearn.preprocessing import StandardScaler
scale=StandardScaler()
X_train = scale.fit_transform(X_train)
X_test = scale.transform(X_test)
```

**What is the class imbalance?**

When a dataset exhibits a class imbalance problem, it means one category has more data points than another. This imbalance skews the distribution of class labels. Let’s examine if our dataset faces a class imbalance issue.

```
#check for distribution of labels
y_train.value_counts(normalize=True)
```

We can observe that the majority of the labels are from class 0 and only a few are from class 1.

If **we use this distribution** to develop our model, it may become biased towards predicting the majority class since there will be insufficient data to learn minority class patterns. **The model will** **start predicting every new observation as 0** or **majority class**. (In our problem employee is not promoted). We’ll get more model accuracy here, but it won’t be a decent model because it **won’t predict **class 1 or **minority class**, which is a crucial class.

As a result, we must consider class imbalance when developing our Logistic Regression model.

**How to Handle Class Imbalance?**

There are a variety of approaches to dealing with class imbalance, such as **increasing minority class samples **or **decreasing majority class samples** to ensure that both classes have the same distribution.

Because we’re using the Scikit-learn machine library to create the model, it has a **logistic regression implementation that supports class weighting**. We will use the inbuilt parameter **“class_weight**” while creating an instance of the Logistic Regression model.

Both the majority and minority classes will be given separate weights. During the training phase, the weight differences will influence the classification of the classes.

The purpose of adding class weights is to penalize the minority class for misclassification by setting a higher class weight while decreasing the weight for the majority class.

To implement Logistic Regression, we will use the Scikit-learn library. We’ll start by building a base model with default parameters, then look at how to improve it with Hyperparameter Tuning.

As previously stated, we will use the** “class_weight” parameter to address the problem of class imbalance**. Let’s start by creating our base model with the code below.

```
#import library
from sklearn.linear_model import LogisticRegression
#make instance of model with default parameters except class weight
#as we will add class weights due to class imbalance problem
lr_basemodel =LogisticRegression(class_weight={0:0.1,1:0.9})
# train model to learn relationships between input and output variables
lr_basemodel.fit(X_train,y_train)
```

After training our model on the training dataset, we used our model to predict values for the test dataset and recorded them in the y_pred_basemodel variable.

Let’s look at which metrics to use and how to evaluate our base model.

To evaluate performance or our model **we will be using “f1 score” as this is a** **class imbalance** **problem** using accuracy as a performance metrics is not good also, we can say that f1 score is the go-to metric when we have a class imbalance problem. The formula for calculating the F1 score is as follows:

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

**Precision** is the ratio of accurately predicted positive observations to the total predicted positive observations.

Precision = TP/TP+FP

**Recall **is the ratio of accurately predicted positive observations to all observations in actual class – yes.

Recall = TP/TP+FN

**F1 Score** is the weighted average of Precision and Recall. Therefore, this score **takes both false positives and false negatives into account.**

Let’s evaluate our base model using the f1 score.

```
from sklearn.metrics import f1_score
print("f1 score for base model is : " , f1_score(y_test,y_pred_basemodel))
```

We got a 0.37 f1 score on our base model created using default parameters.

Up to this point, we saw how to create a logistic regression model using default parameters.

Now let’s increase model performance and evaluate it again after tuning hyperparameters of the model.

The model learns parameters like weight and bias from data, while hyperparameters dictate the model’s structure. Hyperparameter tuning, the process of optimizing fit or architecture, controls overfitting or underfitting. Algorithms like Grid Search or Random Search are employed for hyperparameter tuning.

We will use Grid Search which is the most basic method of searching optimal values for hyperparameters. To tune hyperparameters, follow the steps below:

- Create a model instance of the Logistic Regression class
- Specify hyperparameters with all possible values
- Define performance evaluation metrics
- Apply cross-validation
- Train the model using the training dataset
- Determine the best values for the hyperparameters given.

We can use the below code to implement hyperparameter tuning in python using the Grid Search method.

```
#Hyperparameter tuning
# define model/create instance
lr=LogisticRegression()
#tuning weight for minority class then weight for majority class will be 1-weight of minority class
#Setting the range for class weights
weights = np.linspace(0.0,0.99,500)
#specifying all hyperparameters with possible values
param= {'C': [0.1, 0.5, 1,10,15,20], 'penalty': ['l1', 'l2'],"class_weight":[{0:x ,1:1.0 -x} for x in weights]}
# create 5 folds
folds = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42)
#Gridsearch for hyperparam tuning
model= GridSearchCV(estimator= lr,param_grid=param,scoring="f1",cv=folds,return_train_score=True)
#train model to learn relationships between x and y
model.fit(X_train,y_train)
```

After fitting the model, we will extract the best fit values for all specified hyperparameters.

```
# print best hyperparameters
print("Best F1 score: ", model.best_score_)
print("Best hyperparameters: ", model.best_params_)
```

We will now build our Logistic Regression model using the above values we got by tuning Hyperparameters.

Let’s use the below code to build our model again.

```
#Building Model again with best params
lr2=LogisticRegression(class_weight={0:0.27,1:0.73},C=20,penalty="l2")
lr2.fit(X_train,y_train)
```

After training our final model it’s time to evaluate our Logistic Regression model using chosen metrics.

We will evaluate our model on Test Dataset. First, we will predict values on the Test dataset.

**We chose “f1 score” as our performance metric** above, but let’s look at the scores for all of the metrics, including confusion metrics, precision, recall, ROC-AUC score, and ultimately f1 score, for learning purposes.

Then, we’ll compare our final model’s f1 score to our base model to see if it’s improved.

We’ll use the code below to calculate the score

for various metrics:

```
# predict probabilities on Test and take probability for class 1([:1])
y_pred_prob_test = lr2.predict_proba(X_test)[:, 1]
#predict labels on test dataset
y_pred_test = lr2.predict(X_test)
# create onfusion matrix
cm = confusion_matrix(y_test, y_pred_test)
print("confusion Matrix is :nn",cm)
print("n")
# ROC- AUC score
print("ROC-AUC score test dataset: t", roc_auc_score(y_test,y_pred_prob_test))
#Precision score
print("precision score test dataset: t", precision_score(y_test,y_pred_test))
#Recall Score
print("Recall score test dataset: t", recall_score(y_test,y_pred_test))
#f1 score
print("f1 score test dataset : t", f1_score(y_test,y_pred_test))
```

We can see that by tuning hyperparameters, we were able to improve the performance of our model since our F1 Score for the final model (0.43) is higher than that of the base model (0.37). After the hyperparameter tuning model got a 0.88 ROC-AUC score.

With this, we were able to construct our logistic regression model and test it on the Test dataset. More feature engineering, hyperparameter optimization, and cross-validation techniques can improve its performance even more.

We began our learning journey by understanding the basics of machine learning and logistic regression. Then we moved on to the implementation of a Logistic Regression model in Python. We learned key steps in Building a Logistic Regression model like Data cleaning, EDA, Feature engineering, feature scaling, handling class imbalance problems, training, prediction, and evaluation of model on the test dataset. Apart from that, we learned how to use Hyperparameter Tuning to improve the performance of our model and avoid overfitting and underfitting.

I hope you find this information useful and will try it out.

Connect with me on LinkedIn.

A. Logistic regression involves defining a logistic function to model the probability of an event occurring. Parameters are estimated using maximum likelihood estimation, and predictions are based on the probability threshold.

A. The first step in logistic regression is collecting and preparing the data. This includes defining the dependent variable, selecting independent variables, handling missing data, and scaling features.

A. Logistic regression learns by iteratively updating parameters to maximize the likelihood of observed outcomes. The process involves calculating probabilities, comparing predictions to actual outcomes, and adjusting parameters using optimization algorithms like gradient descent.

A. Stepwise logistic regression is a feature selection method where variables are added or removed stepwise based on statistical criteria. It helps identify the most relevant predictors for the logistic regression model.

**The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.**

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask

Sorry, This code has a proble, maybe there is some missing code?13 14 # Evaluate the base model using the f1 score. 15 from sklearn.metrics import f1_score ---> 16 print("f1 score for base model is : " , f1_score(y_test,y_pred_basemodel))NameError: name 'y_pred_basemodel' is not definedOtherwise a nice presentation, thank you!

Maybe some code is missing?---> 26 folds = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 42) 27 # Gridsearch for hyperparam tuning 28 model= GridSearchCV(estimator= lr,param_grid=param,scoring="f1",cv=folds,return_train_score=True)NameError: name 'StratifiedKFold' is not defined