**This article was published as a part of the Data Science Blogathon**

Sentence classification is one of the simplest NLP tasks that have a wide range of applications including document classification, spam filtering, and sentiment analysis. A sentence is classified into a class in sentence classification. A question database will be used for this article and each question will be labeled by what the question is about. For ex- “Who was Abraham Lincoln” will be a question and its label will be “person”.

We will use this dataset- http://cogcomp.org/Data/QA/QC/

Let’s look at the fundamental idea behind a CNN without going into too much technical detail. A CNN is a stack of layers, similar to convolution layers, pooling layers, and fully connected layers. Each of these will be discussed to understand their role in CNN. At first, the input is connected to a set of convolution layers. These convolution layers slide a patch of weights over the input and produce output by means of the convolution operation. A small number of weights is used by Convolution layers that are organized to cover only a small patch of input in each layer and these weights are spread across some dimensions (for example, the width and height dimensions of an image). Apart from this, convolution operations are used by CNNs to share the weights, form the output by sliding this small set of weights and the desired dimension. The result we get from this convolution operation is shown in Figure. In case the pattern present in a convolution filter is present in a patch of the image, the convolution will have a high-value output for that location, otherwise, it will output a low value. And, by convolving the whole image, the matrix we get indicates whether a pattern was present or not in a certain location. At last, we will get a matrix as the convolution output:

We will perform the following operations on the text document:

- The transformation of sentences into a preferred format that can easily be dealt with by CNNs.
- Convolution and pooling operations are performed for sentence classification.

Let’s consider this example for a better understanding:-

- Bob and Mary are friends.
- Bob plays Soccer.
- Mary likes to sing in the choir.

The third sentence has the most words. Therefore, n=7. Now, let’s perform a One-Hot encoding of these words. There are 13( k=13) distinct words.

- Bob – 1,0,0,0,0,0,0,0,0,0,0,0,0
- and – 0,1,0,0,0,0,0,0,0,0,0,0,0
- Mary – 0,0,1,0,0,0,0,0,0,0,0,0,0

Similarly, for 3 sentences, we will have a three-dimensional matrix of 3*7*13.

Let’s consider we are processing only one sentence at a time, then there will be a n*k matrix where n is the number of words per sentence after padding, and k being the dimension of a single word. In the above example, this would be 7*13.

Now the weight of the matrix of size m*k is defined,

where m is the filter size for a one-dimensional convolution operation.

By convolving the input x of size n*k with a weight matrix W of size m*k, we will produce an output of h of size l*n as follows:

Here, w_{i,j}

is the (i,j)^{
th} element of W and we will pad x with zeros so that

h is of size l*n.

h=W*x+b

Here, * denotes the convolution operation (along with padding) and an additional scalar bias b is added.

For a rich set of features, parallel layers with different convolution filter sizes are used. Each convolution layer gives a hidden vector of size l*n, these outputs are concatenated to form the input to the next layer of size q*n, where q is the number of parallel layers. A large value of q is preferred for better performance.

The purpose of the pooling operation is to subsample the outputs from the previously discussed parallel convolution layers. For this let’s assume the output of the last layer h is of size q*n. The pooling

over time layer would then give an output h’ and size q*l output.

After combining these operations, we get this architecture finally

At first, we will define the inputs and outputs. A batch of sentences will be our input, we will represent the words by one-hot-encoded

word vectors.

```
sent_inputs = tf.placeholder(shape=
[batch_size,sent_length,vocabulary_size],dtype=tf.fl
oat32,name='sentence_inputs')
sent_labels = tf.placeholder(shape=
[batch_size,num_classes],dtype=tf.float32,name='sent
ence_labels')
```

Then, we will define three different one-dimensional convolution layers along with three different sizes of filters with their respective biases:

w1 = tf.Variable(tf.truncated_normal([filter_sizes[0],voc abulary_size,1],stddev=0.02,dtype=tf.float32),name=' weights_1') b1 = tf.Variable(tf.random_uniform([1],0,0.01,dtype=tf.fl oat32),name='bias_1') w2 = tf.Variable(tf.truncated_normal([filter_sizes[1],voc abulary_size,1],stddev=0.02,dtype=tf.float32),name=' weights_2') b2 = tf.Variable(tf.random_uniform([1],0,0.01,dtype=tf.fl oat32),name='bias_2') w3 = tf.Variable(tf.truncated_normal([filter_sizes[2],voc abulary_size,1],stddev=0.02,dtype=tf.float32),name=' weights_3') b3 = tf.Variable(tf.random_uniform([1],0,0.01,dtype=tf.fl oat32),name='bias_3')

Then, we will calculate three outputs, each belonging to a single convolution layer. We will use a stride of 1 and zero padding to make sure that the outputs contain the same size as the input:

h1_1 = tf.nn.relu(tf.nn.conv1d(sent_inputs,w1,stride=1,padd ing='SAME') + b1) h1_2 = tf.nn.relu(tf.nn.conv1d(sent_inputs,w2,stride=1,padd ing='SAME') + b2) h1_3 = tf.nn.relu(tf.nn.conv1d(sent_inputs,w3,stride=1,padd ing='SAME') + b3)

Then, we need to write the elementary functions to do that in TensorFlow, for calculating max pooling over time as TensorFlow does not have an inbuilt function to do this. We will calculate the maximum value of each hidden output which is produced by each convolution layer. This will give a single scalar for

each layer:

h2_1 = tf.reduce_max(h1_1,axis=1) h2_2 = tf.reduce_max(h1_2,axis=1) h2_3 = tf.reduce_max(h1_3,axis=1)

Then we will concatenate the outputs that are produced on axis 1 to

give an output of size batchsize*q

h2 = tf.concat([h2_1,h2_2,h2_3],axis=1)

Then, we will define the fully connected layers, that will be entirely connected to the output that is produced by the pooling over time layer. There is a single fully connected layer here in this case and this will also be our output layer:

w_fc1 = tf.Variable(tf.truncated_normal([len(filter_sizes),n um_classes],stddev=0.5,dtype=tf.float32),name='weigh ts_fulcon_1') b_fc1 = tf.Variable(tf.random_uniform([num_classes],0,0.01,d type=tf.float32),name='bias_fulcon_1')

The function that is defined here will produce the logits which will be then used to calculate the loss of the network:

logits = tf.matmul(h2,w_fc1) + b_fc1

Then, by applying the softmax activation to the logits, we will get the predictions:

predictions = tf.argmax(tf.nn.softmax(logits),axis=1)

Then, we will define the loss function, that is the cross-entropy loss:

loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logi ts_v2(labels=sent_labels,logits=logits))

optimizer = tf.train.MomentumOptimizer(learning_rate=0.01,moment um=0.9).minimize(loss)

To optimize the model, MomentumOptimizer is used which is a TensorFlow built-in optimizer.

Performing these operations to optimize the CNN and evaluate the test data, gives us a test accuracy which is approximately 90% (500 test sentences) in this sentence classification task.

In this article we discussed the following :

- A combination of one-dimensional convolution operations with pooling over time can be used to implement a sentence classifier based on CNN architecture.
- Use of TensorFlow in implement g such a CNN and its performance.
- In real life, it can be used in this way- for ex- if we want to search about Julius Ceasar without reading the whole document from a large document containing the history of Rome. A sentence classifier will be very useful for these types of tasks.
- Sentence classification can be used for other tasks like classifying movie reviews and automation of movie ratings.

*The idea for writing this article is taken from NLP with Tensorflow by Thushan Ganegedara.*

**Nilanjan Sengupta **

- linkedin.com/in/nilanjan-sengupta-2529241b2
- [email protected]

- Image 1 – https://books.google.co.in/books/about/Natural_Language_Processing_with_TensorF.html?id=trhwswEACAAJ&redir_esc=y
- Image 2 – https://books.google.co.in/books/about/Natural_Language_Processing_with_TensorF.html?id=trhwswEACAAJ&redir_esc=y
- Image 3 – https://books.google.co.in/books/about/Natural_Language_Processing_with_TensorF.html?id=trhwswEACAAJ&redir_esc=y
- Image 4 – https://books.google.co.in/books/about/Natural_Language_Processing_with_TensorF.html?id=trhwswEACAAJ&redir_esc=y
- Image 5 – https://books.google.co.in/books/about/Natural_Language_Processing_with_TensorF.html?id=trhwswEACAAJ&redir_esc=y
- Image 6 – https://books.google.co.in/books/about/Natural_Language_Processing_with_TensorF.html?id=trhwswEACAAJ&redir_esc=y

**The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.**

Hello, please correct the following: phrase "of size l*n as follows:" should be of size 1*n as follows" Thus, "l" should be 1. Cheers, for the great content!