Introduction to Softmax for Neural Network

Shipra Saxena 24 May, 2024 • 8 min read


  • The activation function is one of the building blocks on Neural Network
  • Understand how the Softmax activation works in a multiclass classification problem


The activation function is an integral part of a neural network. Without an activation function, a neural network is a simple linear regression model. This means the activation function gives non-linearity to the neural network gradient parameter.

Note: If you are more interested in learning concepts in an Audio-Visual format, We have this entire article explained in the video below. If not, you may continue reading.

If you want to dig deeper, I will recommend you to go through the following article.

Fundamentals of Deep Learning – Activation Functions and When to Use Them?

In this article, we will discuss the SoftMax activation function. It is popularly used for multiclass classification problems. Let’s first understand the neural network architecture for a multi-class classification problem and also why other activation functions can not be used in this case.


Suppose, we have the following dataset and for every observation, we have five features from FeatureX1 to FeatureX5 and the target variable has three classes.

Softmax data

Now let’s create a simple neural network for this problem. Here, we have an Input layer with five neurons as we have five features in the dataset. Next, we have one hidden layer which has four neurons. Each of these neurons uses inputs, weights, and biases here to calculate a value which is represented as Zij here.

neural network softmax

For example, the first neuron of the first layer is represented as Z11 Similarly the second neuron of the first layer is represented as Z12, and so on.

Over these values, we apply the activation function. Let’s say a tanh activation function and send the values or result to the output layer.

The number of neurons in the output layer depends on the number of classes in the dataset. Since we have three classes in the dataset we will have three neurons in the output layer. Each of these neurons will give the probability of individual classes. This means the first neuron will give you the probability that the data point belongs to class 1. Similarly, the second neuron will give you the probability that the data point belongs to class 2 and so on.

Why is softmax used in the last layer?

Here’s how the softmax function works in the last layer of a neural network :

Input: The softmax function takes a vector of real numbers (z) as input. These values typically represent the outputs from the final hidden layer of the neural network, often accessed via an API.

Exponentiation: Each element in the input vector z is exponentiated using the mathematical constant e (approximately 2.718). This step ensures all the values become positive. The derivative of this step is crucial for backpropagation.

Normalization: After exponentiation, all the elements are summed up. This is a key step for ensuring that the probabilities add up to 1.

Probability Calculation: Each exponentiated value from step 2 is then divided by the sum obtained in step 3. This process normalizes the values, forcing them to be between 0 and 1. The cross-entropy loss function often uses these probabilities to measure the performance of a classifier.

Output: The result is a new vector with the same size as the input vector z. However, each element in the output vector now represents a probability between 0 and 1. The argmax function is typically used to select the index of the highest probability, determining the predicted class generalization.

Why Not Sigmoid?

Suppose we calculate the Z value using weights and biases of this layer and apply the sigmoid activation function over these values. We know that the sigmoid activation function gives the value between 0 and 1. suppose these are the values we get as output.

sigmoid softmax

There are two problems in this case-

First, if we apply a thresh-hold of say 0.5, this network says the input data point belongs to two classes. Secondly, these probability values are independent of each other. That means the probability that the data point belongs to class 1 does not take into account the probability of the other two classes.

This is the reason the sigmoid activation function is not preferred in multi-class classification problems.

Softmax Activation

Instead of using sigmoid, we will use the Softmax activation function in the output layer in the above example. The Softmax activation function calculates the relative probabilities. That means it uses the value of Z21, Z22, Z23 to determine the final probability value.

Let’s see how the softmax activation function actually works. Similar to the sigmoid activation function the SoftMax function returns the probability of each class. Here is the equation for the SoftMax activation function.

SoftMax Activation formula

Here, the Z represents the values from the neurons of the output layer. The exponential acts as the non-linear function. Later these values are divided by the sum of exponential values in order to normalize and then convert them into probabilities.

Note that, when the number of classes is two, it becomes the same as the sigmoid activation function. In other words, sigmoid is simply a variant of the Softmax function. If you want to learn more about this concept, refer to this link.

Let’s understand with a simple example how the softmax works, We have the following neural network.

SoftMax Activation Multiclass problem

Suppose the value of Z21, Z22, Z23 comes out to be 2.33, -1.46, and 0.56 respectively. Now the SoftMax activation function is applied to each of these neurons and the following values are generated.

hidden layer

These are the probability values that a data point belonging to the respective classes. Note that, the sum of the probabilities, in this case, is equal to 1.

sum of the probabilities in this case is equal to 1.

In this case it clear that the input belongs to class 1. So if the probability of any of these classes is changed, the probability value of the first class would also change.

Why Softmax is useful in CNN?

  • Softmax allows CNNs to output a probability distribution over the possible classes. This is important because it allows the CNN to make more accurate predictions.
  • Softmax works by first normalizing the input vector so that all of the numbers in the vector sum to 1. Then, it exponentiates each number in the vector and divides by the sum of all of the exponentiated numbers. This results in a vector of probabilities, where each probability is between 0 and 1 and represents the probability that the input belongs to a particular class.
  • The probability distribution output by the softmax function can then be used to make a more accurate prediction about the class of an input image. For example, if the CNN is predicting whether an image contains a cat or a dog, the probability distribution can indicate how likely it is that the image contains a cat and how likely it is that the image contains a dog.

When to use Softmax vs ReLU

Softmax is typically used in the last layer of a neural network to predict the class of an input image. It is also used in other applications, such as natural language processing and machine translation.

ReLU is typically used in the hidden layers of a neural network to add non-linearity. It is very efficient and can help neural networks learn more complex relationships between the input and output data.

Why is Softmax used in CNN?

Here is how Softmax used in CNN :

CNN Processes Image: The CNN takes an image as input and performs various convolutional and pooling operations to extract features.

Final Layer Generates Logits: After processing, the final layer of the CNN outputs a set of numbers called logits. These logits represent the raw scores or activation levels for each class the CNN can classify. There will be one logit for each class.

Softmax Takes Over: The Softmax function takes these logits as input.

Exponentiation: Softmax applies an exponent function (often 𝑒𝑥ex) to each logit value. This emphasizes the differences between the logits, making the higher-scoring classes stand out more.

Normalization: Softmax then divides each exponentiated value by the sum of all the exponentiated values. This ensures the final outputs add up to 1.

Probability Distribution: The result is a vector of numbers between 0 and 1, representing probabilities. Each value corresponds to the probability of the image belonging to a specific class.

Decision and Interpretation: The class with the highest probability value is considered the predicted class by the CNN. This probability value also reflects the CNN’s confidence level in its prediction.

In machine learning, functions like softmax output are implemented in frameworks such as numpy and python to facilitate the process. The softmax function, through exponentiation, transforms the logits into a probability distribution. This method is crucial in determining the loss function during model training and optimization. The CNN’s ability to make precise predictions hinges on these fundamental principles.

End Notes

This is all about the SoftMax activation function in this article. Here we saw why we should not use activation functions like sigmoid or thresh hold in the multiclass classification problems and also how softmax function works through an example.In this article, you will see different softmax output values and different output vector by these algorithms. This tutorial will explain everything about the Softmax activation function and its largest value.This article also, you can find on wikipedia,but the proper explanation whether it is def softmax or various exponential function like import numpy, its input values gave the maximum value of these logistic.

Frequently Asked Questions

Q1.What is the softmax function?

The softmax function is a mathematical function that converts a vector of real numbers into a probability distribution. It exponentiates each element, making them positive, and then normalizes them by dividing by the sum of all exponentiated values. This ensures that the output probabilities add up to one, making it suitable for multiclass classification tasks.

Q2. What is the difference between sigmoid and softmax functions?

The sigmoid function is used for binary classification, mapping any real value to a range between 0 and 1. It’s suitable for independent predictions. The softmax function, on the other hand, converts a vector of real numbers into a probability distribution for multiclass classification tasks, ensuring that the sum of the probabilities is equal to one

If you are looking to kick start your Data Science Journey and want every topic under one roof, your search stops here. Check out Analytics Vidhya’s Certified AI & ML BlackBelt Plus Program

If you have any queries let me know in the comments below!

Shipra Saxena 24 May 2024

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers


Deep Learning
Become a full stack data scientist