Nilanjan Sengupta — September 2, 2021
Advanced Classification Deep Learning NLP Text

This article was published as a part of the Data Science Blogathon


Sentence classification is one of the simplest NLP tasks that have a wide range of applications including document classification, spam filtering, and sentiment analysisA sentence is classified into a class in sentence classification. A question database will be used for this article and each question will be labeled by what the question is about. For ex- “Who was Abraham Lincoln” will be a question and its label will be “person”.

We will use this dataset-

CNN Fundamentals

Let’s look at the fundamental idea behind a CNN without going into too much technical detail. A CNN is a stack of layers, similar to convolution layers, pooling layers, and fully connected layers. Each of these will be discussed to understand their role in CNN. At first, the input is connected to a set of convolution layers. These convolution layers slide a patch of weights over the input and produce output by means of the convolution operation. A small number of weights is used by Convolution layers that are organized to cover only a small patch of input in each layer and these weights are spread across some dimensions (for example, the width and height dimensions of an image). Apart from this, convolution operations are used by CNNs to share the weights, form the output by sliding this small set of weights and the desired dimension. The result we get from this convolution operation is shown in Figure. In case the pattern present in a convolution filter is present in a patch of the image, the convolution will have a high-value output for that location, otherwise, it will output a low value. And, by convolving the whole image, the matrix we get indicates whether a pattern was present or not in a certain location. At last, we will get a matrix as the convolution output:

Natural Language Processing Using CNNs for Sentence Classification 1
Image 1 – NLP with TensorFlow by Thushan Ganegedara


CNN Structure

We will perform the following operations on the text document:

  • The transformation of sentences into a preferred format that can easily be dealt with by CNNs.
  • Convolution and pooling operations are performed for sentence classification.

Data Transformation

Let’s consider this example for a better understanding:-

  • Bob and Mary are friends.
  • Bob plays Soccer.
  • Mary likes to sing in the choir.

The third sentence has the most words. Therefore, n=7. Now, let’s perform a One-Hot encoding of these words. There are 13( k=13) distinct words.

  • Bob – 1,0,0,0,0,0,0,0,0,0,0,0,0
  • and – 0,1,0,0,0,0,0,0,0,0,0,0,0
  • Mary – 0,0,1,0,0,0,0,0,0,0,0,0,0

Similarly, for 3 sentences, we will have a three-dimensional matrix of 3*7*13.

Natural Language Processing Using CNNs for Sentence Classification 2
Image 2 – NLP with TensorFlow by Thushan Ganegedara

The Convolution operation

Let’s consider we are processing only one sentence at a time, then there will be a n*k matrix where n is the number of words per sentence after padding, and k being the dimension of a single word. In the above example, this would be 7*13.

Now the weight of the matrix of size m*k is defined,
where m is the filter size for a one-dimensional convolution operation.
By convolving the input x of size n*k with a weight matrix W of size m*k, we will produce an output of h of size l*n as follows:

Natural Language Processing Using CNNs for Sentence Classification 3
Image 3 – NLP with TensorFlow by Thushan Ganegedara 

Here, wi,j
is the (i,j)
element of W and we will pad x with zeros so that
h is of size l*n.


Here, * denotes the convolution operation (along with padding) and an additional scalar bias b is added.

Image 4 – NLP with TensorFlow by Thushan Ganegedara 

For a rich set of features, parallel layers with different convolution filter sizes are used. Each convolution layer gives a hidden vector of size l*n,  these outputs are concatenated to form the input to the next layer of size q*n, where q is the number of parallel layers. A large value of q is preferred for better performance.

The Pooling Operation

The purpose of the pooling operation is to subsample the outputs from the previously discussed parallel convolution layers. For this let’s assume the output of the last layer h is of size q*n. The pooling
over time layer would then give an output h’ and size q*l output. 

Image 4 – NLP with TensorFlow by Thushan Ganegedara
Image 5 – NLP with TensorFlow by Thushan Ganegedara 

After combining these operations, we get this architecture finally

Image 6 – NLP with TensorFlow by Thushan Ganegedara

Implementation with code

At first, we will define the inputs and outputs. A batch of sentences will be our input, we will represent the words by one-hot-encoded
word vectors.

sent_inputs = tf.placeholder(shape=
sent_labels = tf.placeholder(shape=


Then, we will define three different one-dimensional convolution layers along with three different  sizes of filters with  their respective biases:


w1 =
b1 =
w2 =
b2 =
w3 =
b3 =

Then, we will calculate three outputs, each belonging to a single convolution layer. We will use a stride of 1 and zero padding to make sure  that the outputs contain the same size as the input:

h1_1 =
ing='SAME') + b1)
h1_2 =
ing='SAME') + b2)
h1_3 =
ing='SAME') + b3)

Then, we need to write the elementary functions to do that in TensorFlow, for calculating max pooling over time as TensorFlow does not have an inbuilt function to do this. We will calculate the maximum value of each hidden output which is produced by each convolution layer. This will give  a single scalar for
each layer:

h2_1 = tf.reduce_max(h1_1,axis=1)
h2_2 = tf.reduce_max(h1_2,axis=1)
h2_3 = tf.reduce_max(h1_3,axis=1)

Then we will concatenate the outputs that are produced on axis 1  to
give an output of size batchsize*q

h2 = tf.concat([h2_1,h2_2,h2_3],axis=1)


Then, we will define the fully connected layers, that will be entirely connected to the output that is produced by the pooling over time layer. There is a single fully connected layer here in this case and this will also be our output layer:

w_fc1 = tf.Variable(tf.truncated_normal([len(filter_sizes),n
b_fc1 =

The function that is defined here will produce the logits which will be then used to calculate the loss of the network:

logits = tf.matmul(h2,w_fc1) + b_fc1

Then, by applying the softmax activation to the logits, we will get the predictions:

predictions =

Then, we will define the loss function, that is the cross-entropy loss:

loss =

optimizer =

To optimize the model, MomentumOptimizer is used which is a TensorFlow built-in optimizer.

Performing these operations to optimize the CNN and evaluate the test data, gives us a test accuracy which is approximately 90% (500 test sentences) in this sentence classification task.

Ending Notes

In this article we discussed the following :

  • A combination of one-dimensional convolution operations with pooling over time can be used to implement a sentence classifier based on CNN architecture.
  • Use of TensorFlow in implement g such a CNN and its performance.
  • In real life, it can be used in this way- for ex- if we want to search about Julius Ceasar without reading the whole document from a large document containing the history of Rome. A sentence classifier will be very useful for these types of tasks.
  • Sentence classification can be used for other tasks like classifying movie reviews and automation of movie ratings.


The idea for writing this article is taken from NLP with Tensorflow by Thushan Ganegedara.

Nilanjan Sengupta 

Image Source

  1. Image 1 –
  2. Image 2 –
  3. Image 3 –
  4. Image 4 –
  5. Image 5 –
  6. Image 6 –


The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

About the Author

Nilanjan Sengupta

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *