Binary Cross Entropy aka Log LossThe cost function used in Logistic Regression
This article was published as a part of the Data Science Blogathon.
Overview

Challenges if we use the Linear Regression model to solve a classification problem.

Why is MSE not used as a cost function in Logistic Regression?

This article will cover the mathematics behind the Log Loss function with a simple example.
Prerequisites for this article:

Linear Regression

Logistic Regression

Gradient Descent
INTRODUCTION
`Winter is here`. Let’s welcome winters with a warm data science problem 😉
Let’s take a case study of a clothing company that manufactures jackets and cardigans. They want to have a model that can predict whether the customer will buy a jacket (class 1) or a cardigan(class 0) from their historical behavioral pattern so that they can give specific offers according to the customer’s needs. As a data scientist, you need to help them to build a predictive model.
When we start Machine Learning algorithms, the first algorithm we learn about is `Linear Regression` in which we predict a continuous target variable.
If we use Linear Regression in our classification problem, we will get a bestfit line like this:
Z = ßX + b
Problem with the linear line:
When you extend this line, you will have values greater than 1 and less than 0, which do not make much sense in our classification problem. It will make a model interpretation a challenge. That is where `Logistic Regression` comes in. If we needed to predict sales for an outlet, then this model could be helpful. But here we need to classify customers.
We need a function to transform this straight line in such a way that values will be between 0 and 1:
Ŷ = Q (Z)
Q (Z) =1/1+ e^{z } (Sigmoid Function)
Ŷ =1/1+ e^{z}
After transformation, we will get a line that remains between 0 and 1. Another advantage of this function is all the continuous values we will get will be between 0 and 1 which we can use as a probability for making predictions. For example, if the predicted value is on the extreme right, the probability will be close to 1 and if the predicted value is on the extreme left, the probability will be close to 0.
Selecting the right model is not enough. You need a function that measures the performance of a Machine Learning model for given data. Cost Function quantifies the error between predicted values and expected values.
`If you can’t measure it, you can’t improve it.`
Another thing that will change with this transformation is Cost Function. In Linear Regression, we use `Mean Squared Error` for cost function given by:
and when this error function is plotted with respect to weight parameters of the Linear Regression Model, it forms a convex curve which makes it eligible to apply Gradient Descent Optimization Algorithm to minimize the error by finding global minima and adjust weights.
Why don’t we use `Mean Squared Error as a cost function in Logistic Regression?
In Logistic Regression Ŷi is a nonlinear function(Ŷ=1/1+ e^{z}), if we put this in the above MSE equation it will give a nonconvex function as shown:

When we try to optimize values using gradient descent it will create complications to find global minima.

Another reason is in classification problems, we have target values like 0/1, So (ŶY)^{2} will always be in between 01 which can make it very difficult to keep track of the errors and it is difficult to store high precision floating numbers.
The cost function used in Logistic Regression is Log Loss.
What is Log Loss?
Log Loss is the most important classification metric based on probabilities. It’s hard to interpret raw logloss values, but logloss is still a good metric for comparing models. For any given problem, a lower log loss value means better predictions.
Mathematical interpretation:
Log Loss is the negative average of the log of corrected predicted probabilities for each instance.
Let us understand it with an example:
The model is giving predicted probabilities as shown above.
What are the corrected probabilities?
> By default, the output of the logistics regression model is the probability of the sample being positive(indicated by 1) i.e if a logistic regression model is trained to classify on a `company dataset` then the predicted probability column says What is the probability that the person has bought jacket. Here in the above data set the probability that a person with ID6 will buy a jacket is 0.94.
In the same way, the probability that a person with ID5 will buy a jacket (i.e. belong to class 1) is 0.1 but the actual class for ID5 is 0, so the probability for the class is (10.1)=0.9. 0.9 is the correct probability for ID5.
We will find a log of corrected probabilities for each instance.
As you can see these log values are negative. To deal with the negative sign, we take the negative average of these values, to maintain a common convention that lower loss scores are better.
In short, there are three steps to find Log Loss:

To find corrected probabilities.

Take a log of corrected probabilities.

Take the negative average of the values we get in the 2nd step.
If we summarize all the above steps, we can use the formula:
Here Yi represents the actual class and log(p(yi)is the probability of that class.

p(yi) is the probability of 1.

1p(yi) is the probability of 0.
Now Let’s see how the above formula is working in two cases:

When the actual class is 1: second term in the formula would be 0 and we will left with first term i.e. yi.log(p(yi)) and (11).log(1p(yi) this will be 0.

When the actual class is 0: Firstterm would be 0 and will be left with the second term i.e (1yi).log(1p(yi)) and 0.log(p(yi)) will be 0.
wow!! we got back to the original formula for binary crossentropy/log loss 🙂
The benefits of taking logarithm reveal themselves when you look at the cost function graphs for actual class 1 and 0 :

The Red line represents 1 class. As we can see, when the predicted probability (xaxis) is close to 1, the loss is less and when the predicted probability is close to 0, loss approaches infinity.

The Black line represents 0 class. As we can see, when the predicted probability (xaxis) is close to 0, the loss is less and when the predicted probability is close to 1, loss approaches infinity.
ENDNOTES
From this article, I truly hope you
Get the intuition behind the `Log Loss` function.
Know the reasons why we are using `Log Loss` in Logistic Regression instead of MSE.
Happy Learning 🙂
3 thoughts on "Binary Cross Entropy aka Log LossThe cost function used in Logistic Regression"
Rushil Nandan Dubey says: November 10, 2020 at 2:08 pm
Worth reading and great content.Aashish says: November 10, 2020 at 5:06 pm
Very well written blog. Short, crisp and equally insightfulDeepika bhalla says: November 10, 2020 at 8:17 pm
That was thoughtful and nicely explained .