SMS Spam Detection Using LSTM – A Hands On Guide!

Basil Saji 20 May, 2021 • 5 min read

This article was published as a part of the Data Science Blogathon   

spam detection image

Introduction

 In today’s world, almost everyone is using a mobile phone and all of them will receive messages(SMS/ email) daily on their phone. But the main thing is that many of the received messages will be spam and only a few of them are ham or required messages.

In this article, we are going to create an SMS spam detection model which will help you to find whether an SMS is spam or not using LSTM.

About Dataset: Here we are using SMS Spam Detection Dataset which contains SMS text and its corresponding label( Spam or Ham).

Implementation

First of all, we are importing all the required libraries for data preprocessing

import pandas as pd
import numpy as np
import re
import collections
import contractions
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('dark_background')
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import warnings
warnings.simplefilter(action='ignore', category=Warning)
import keras
from keras.layers import Dense, Embedding, LSTM, Dropout
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import pickle

Importing the SMS spam detection dataset

df = pd.read_csv("spam.csv", encoding='latin-1')
df.head()
spam detection dataset
df.shape # output - (5572, 8674)

As you can see our data contains some columns which are not useful to us. So let’s drop those columns.

df.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1, inplace=True)

Also, we are renaming the column names for our convenience.

df.columns = ["SpamHam","Tweet"]

Let’s plot the value counts of both spam and ham SMS.

sns.countplot(data["SpamHam"])
spam ham countplot

The number of ham messages is more than that of spam messages in the data.

Before doing the preprocessing techniques let’s plot the count of different words present in our dataset. For this, we are creating a function named word_count_plot.

def word_count_plot(data):
     # finding words along with count
     word_counter = collections.Counter([word for sentence in data for word in sentence.split()])
     most_count = word_counter.most_common(30) # 30 most common words
     # sorted data frame
     most_count = pd.DataFrame(most_count, columns=["Word", "Count"]).sort_values(by="Count")
     most_count.plot.barh(x = "Word", y = "Count", color="green", figsize=(10, 15))
word_count_plot(data["Tweet"])
spam detection word count plot

As you can see most of the words are stopwords. So let’s do some preprocessing techniques on the dataset.

lem = WordNetLemmatizer()
def preprocessing(data):
      sms = contractions.fix(data) # converting shortened words to original (Eg:"I'm" to "I am")
      sms = sms.lower() # lower casing the sms
      sms = re.sub(r'https?://S+|www.S+', "", sms).strip() #removing url
      sms = re.sub("[^a-z ]", "", sms) # removing symbols and numbes
      sms = sms.split() #splitting
      # lemmatization and stopword removal
      sms = [lem.lemmatize(word) for word in sms if not word in set(stopwords.words("english"))]
      sms = " ".join(sms)
      return sms
X = data["v2"].apply(preprocessing)

Yeah!.. We completed the data preprocessing techniques, now let’s plot the word count once again to see the most frequent words.

word_count_plot(X)
wordcount post processing

Now we can see the most common words other than the stopwords. Let’s continue our preprocessing.

Since our output values(Spam or Ham) are categorical values, we have to convert them into a numerical form. So we are encoding this with LabelEncoder.

from sklearn.preprocessing import LabelEncoder
lb_enc = LabelEncoder()
y = lb_enc.fit_transform(data["SpamHam"])

We converted our output feature into numerical form, then, what about the input feature. So, let’s convert the input feature into numerical form by using keras Tokenizer followed by padding.

First, let’s tokenize our data and convert it into a numerical sequence using keras Tokenizer.

tokenizer = Tokenizer() #initializing the tokenizer
tokenizer.fit_on_texts(X)# fitting on the sms data
text_to_sequence = tokenizer.texts_to_sequences(X) # creating the numerical sequence

Let’s look into some text and corresponding numerical sequence

 for i in range(5):
           print("Text               : ",X[i] )
           print("Numerical Sequence : ", text_to_sequence[i])

Output

 

 output

We can also find the index number of the corresponding words.

tokenizer.index_word # this will output a dictionary of index and words

Output

{1: 'call',
 2: 'get',
 3: 'ur',
 4: 'go',
 5: 'free',
 6: 'ok',
 7: 'ltgt',
 8: 'know',
 9: 'day',
 10: 'got',
 11: 'want',
 12: 'come',
 13: 'like',
 14: 'love',
 15: 'good',
 16: 'time',
 17: 'going',
 18: 'text',
 19: 'send',
 20: 'need',
 21: 'one',
 22: 'today',
 23: 'txt',
 24: 'home',
 25: 'lor',
 26: 'see',
 27: 'sorry',
 28: 'stop',
 29: 'r',
 30: 'still',......}

This dict contains 7774 words which mean that our data contains 7774 unique words.

As you can see in text_to_sequence, all the sequences are of different lengths which are not compatible for the model to train. So we should make all the sentences length equal. For this, we are padding the sequences with “0”.

max_length_sequence = max([len(i) for i in text_to_sequence])
 # finding the length of largest sequence
padded_sms_sequence = pad_sequences(text_to_sequence, maxlen=max_length_sequence, 
                                    padding = "pre") 
padded_sms_sequence

Output

array([[   0,    0,    0, ...,   10, 3568,   68],
       [   0,    0,    0, ..., 1177,  330, 1542],
       [   0,    0,    0, ..., 2419,  263, 2420],
       ...,
       [   0,    0,    0, ..., 1028, 7773, 3565],
       [   0,    0,    0, ...,  792,   65,    5],
       [   0,    0,    0, ..., 2152,  367,  145]], dtype=int32)

We prepared the input data suitable for feeding into the model. Now let’s create the LSTM model for training.

TOT_SIZE = len(tokenizer.word_index)+1
def create_model():
    
      lstm_model = Sequential()
      lstm_model.add(Embedding(TOT_SIZE, 32, input_length=max_length_sequence))
      lstm_model.add(LSTM(100))
      lstm_model.add(Dropout(0.4))
      lstm_model.add(Dense(20, activation="relu"))
      lstm_model.add(Dropout(0.3))
      lstm_model.add(Dense(1, activation = "sigmoid"))
      return lstm_model
lstm_model = create_model()
lstm_model.compile(loss = "binary_crossentropy", optimizer = "adam", metrics = ["accuracy"])
lstm_model.summary()
spam detection LSTM architecture

We created our LSTM model, so, let’s train our model with the input and output features created earlier.

lstm_model.fit(padded_sms_sequence, y, epochs = 5, validation_split=0.2, batch_size=16)
training

Both training accuracy(0.9986) and validation accuracy(0.9839) imply that our model is very good at predicting spam and ham SMS.

We can save our model and tokenizer for future uses as a pickle file.

pickle.dump(tokenizer, open("sms_spam_tokenizer.pkl", "wb"))
pickle.dump(lstm_model, open("lstm_model.pkl", "wb"))

Conclusion

Through this article, you will be able to understand and create a text classification model using LSTM architecture. In future articles, we will see other text classification techniques and other Natural Langauge Processing models.

Thank You!..

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion

Basil Saji 20 May 2021

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

Natural Language Processing
Become a full stack data scientist