Beginner Projects to Learn Natural Language Processing using Python !
This article was published as a part of the Data Science Blogathon
Machines understanding language fascinates me, and that I often ponder which algorithms Aristotle would have accustomed build a rhetorical analysis machine if he had the possibility. If you’re new to Data Science, getting into NLP can seem complicated, especially since there are many recent advancements within the field. it’s hard to grasp where to begin.
Table of Contents
1.What can Machines Understand?
2.Project 1:Word Cloud
3.Project 2:Spam Detection
4.Project 3:Sentiment Analysis
What can Machines Understand?
While a computer can be quite good at finding patterns and summarizing documents, it must transform words into numbers before making sense of them. This transformation is highly required because math doesn’t work very well on words and machines “learn” thanks to mathematics. Before the transformation of the words into numbers, Data cleaning is required. Data cleaning includes the removal of special characters and punctuation and modified into forms that make them more uniform and interpretable.
Project 1: Word Cloud
1.Importing Dependencies and Data
Start by importing the dependencies and the data. The data is stored as a comma-separated values (CSV) file, so I will use pandas’ read_csv() function to open it into a DataFrame.
import pandas as pd import sqlite3 import regex as re import matplotlib.pyplot as plt from wordcloud import WordCloud #create dataframe from csv df = pd.read_csv('emails.csv') df.head() df.head()
To eliminate duplicate rows and establish some baseline counts, it is best to do a quick analysis of the data. Here we use pandas drop_duplicates to drop the duplicate rows.
print("spam count: " +str(len(df.loc[df.spam==1]))) print("not spam count: " +str(len(df.loc[df.spam==0]))) print(df.shape) df['spam'] = df['spam'].astype(int) df = df.drop_duplicates() df = df.reset_index(inplace = False)[['text','spam']] print(df.shape)
Counts and shape before/after duplicate removal
What is a Word Cloud?
Word clouds make understanding word frequencies easier so, it is a useful way to visualize text data. Words that appear larger in the cloud are those which appear more frequently within the email text. Word Clouds make it easy to identify “keywords.”
Word Cloud Examples
All the text is lower case in the word cloud image. It does not contain any punctuation marks or special characters. The text now is called cleaned and ready for analysis. With the help of regular expressions, it is easy to clean the text using a loop:
clean_desc =  for w in range(len(df.text)): desc = df['text'][w].lower() #remove punctuation desc = re.sub('[^a-zA-Z]', ' ', desc) #remove tags desc=re.sub("</?.*?>"," <> ",desc) #remove digits and special chars desc=re.sub("(\d|\W)+"," ",desc) clean_desc.append(desc) #assign the cleaned descriptions to the data frame df['text'] = clean_desc df.head(3)
Notice here we create an empty list clean_desc, then use a for loop to go through the text line by line, setting it to lower case, removing punctuation and special chars, and appending it to the list. Then we replace the text column with the data in the clean_desc list.
Stop words are the foremost common words like “the” and “of.” Removing them from the e-mail text allows the more relevant frequent words to square out. Removing stop words may be a common technique! Some Python libraries like NLTK come pre-loaded with a listing of stop words, but it’s easy to form one from scratch.
stop_words = ['is','you','your','and', 'the', 'to', 'from', 'or', 'I', 'for', 'do', 'get', 'not', 'here', 'in', 'im', 'have', 'on', 're', 'new', 'subject']
Notice I include some email-related words like “re” and “subject.” it’s up to the analyst to see what words should be included or excluded. Sometimes it’s beneficial to incorporate all words!
Construct the Word Could
Conveniently there’s a Python library for creating word clouds. It will be installed using pip.
pip install wordcloud
When constructing the word cloud, it’s possible to line several parameters like height and width, stop words, and max words. it’s even possible to shape it rather than displaying the default rectangle.
wordcloud = WordCloud(width = 800, height = 800, background_color = 'black', stopwords = stop_words, max_words = 1000 , min_font_size = 20).generate(str(df1['text'])) #plot the word cloud fig = plt.figure(figsize = (8,8), facecolor = None) plt.imshow(wordcloud) plt.axis('off') plt.show()
To save and display the word cloud. matplotlib and show() are used. Regardless of it being spam, it is the result of all records.
Push the exercise further by splitting the info frame and making two-word clouds to assist analyze the difference between keywords employed in spam email and not spam email.
Project 2: Spam Detection
Consider it as a binary classification problem since an email can either be spam denoted by “1” or not spam denoted by “0”. I would like to create a machine learning model that may identify whether or not an email may be spam. I’m visiting use the Python library Scikit-Learn to explore tokenization, vectorization, and statistical classification algorithms.
Import the Scikit-Learn functionality we’d like to rework and model the info. I’ll use CountVectorizer, train_test_split, ensemble models, and a pair of metrics.
from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import train_test_split from sklearn import ensemble from sklearn.metrics import classification_report, accuracy_score
Transforming Text to Numbers
In project 1, the text was cleaned. once you take a look at a word cloud, notice it’s primarily single words. The larger the word, the upper it’s frequency. to stop the word cloud from outputting sentences, the text goes through a process called tokenization. it’s the method of breaking down a sentence into individual words. The individual words are called tokens.
Using SciKit-Learn’s CountVectorizer(), it’s easy to rework the body of text into a sparse matrix of numbers that the pc can pass to machine learning algorithms. To simplify the concept of count vectorization, imagine you have got two sentences:
The dog is white
The cat is black
Converting the sentences to a vector space model would transform them in such a way that looks at the words in all sentences, and then represents the words in the sentence with a number.
The dog cat is white black
The dog is white = [1,1,0,1,1,0] The cat is black = [1,0,1,1,0,1] We can show this using code as well. I’ll add a third sentence to show that it counts the tokens. #list of sentences text = ["the dog is white", "the cat is black", "the cat and the dog are friends"] #instantiate the class cv = CountVectorizer() #tokenize and build vocab cv.fit(text) print(cv.vocabulary_) #transform the text vector = cv.transform(text) print(vector.toarray())
The sparse matrix of word counts.
Notice within the last vector, you’ll be able to see a 2 since the word “the” appears twice. The CountVectorizer counts the tokens and allows me to construct the sparse matrix containing the transformed words to numbers.
Bag of Words Method
Because the model doesn’t take word placement under consideration and instead mixes the words up as if they were tiles in a very scrabble game, this is often called the bag of words method. I’m visiting to create the sparse matrix, then split the information using SK-learn train_test_split().
text_vec = CountVectorizer().fit_transform(df['text']) X_train, X_test, y_train, y_test = train_test_split(text_vec, df['spam'], test_size = 0.45, random_state = 42, shuffle = True)
Notice I set the sparse matrix text_vec to X and the df[‘spam’] column to Y. I shuffle and take a test size of 45%.
It is highly recommended to experiment with several classifiers and determine which one works best for this scenario. during this example, I’m using the GradientBoostingClassifier() model from the Scikit-Learn Ensemble collection.
classifier = ensemble.GradientBoostingClassifier( n_estimators = 100, #how many decision trees to build learning_rate = 0.5, #learning rate max_depth = 6 )
Each algorithm will have its own set of parameters you’ll be able to tweak. that’s called hyper-parameter tuning. undergo the documentation to find out more about each of the parameters utilized in the models.
Finally, we fit the info, call predict and generate the classification report. Using classification_report(), it’s easy to create a text report showing the most classification metrics.
classifier.fit(X_train, y_train) predictions = classifier.predict(X_test) print(classification_report(y_test, predictions))
Notice our model achieved 97% accuracy.
Project 3: Sentiment Analysis
Sentiment Analysis is additionally a classification problem of sorts. The text is basically visiting to reflect a positive, neutral, or negative sentiment. that’s noted because of the polarity of the text. it’s also possible to determine and account for the subjectivity of the text! There are a lot of great resources that cover the speculation behind sentiment analysis.
Instead of building another model, this project uses a straightforward, out-of-box tool to research sentiment called TextBlob. I’ll use TextBlob to feature sentiment columns to the DataFrame so they are often analyzed.
What is TextBlob?
Built on top of NLTK and pattern, the TextBlob library for Python 2 and three tries to simplify several text processing tasks. It provides tools for classification, part-of-speech tagging, phrase extraction, sentiment analysis, and more. Install it using pip.
pip install -U textblob python -m textblob.download_corpora
Using the sentiment property, TextBlob returns a named tuple of the shape Sentiment(polarity, subjectivity). Polarity may be afloat within the range [-1.0, 1.0] where -1 is that the most negative and 1 is that the most positive. Subjectivity could be afloat within the range [0.0, 1.0] where 0.0 is extremely objective and 1.0 is extremely subjective.
blob = TextBlob("This is a good example of a TextBlob") print(blob)blob.sentiment #Sentiment(polarity=0.7, subjectivity=0.6000000000000001)
Using list comprehensions, it’s easy to load the text column as a TextBlob, so create two new columns to store the Polarity and Subjectivity.
#load the descriptions into textblob email_blob = [TextBlob(text) for text in df['text']] #add the sentiment metrics to the dataframe df['tb_Pol'] = [b.sentiment.polarity for b in email_blob] df['tb_Subj'] = [b.sentiment.subjectivity for b in email_blob] #show dataframe df.head(3)
TextBlob makes it super simple to come up with a baseline sentiment score for polarity and subjectivity. To push this exerciser further, see if you’ll be able to add these new features to the spam detection model to extend the accuracy!
Even though linguistic communication Processing can look like an intimidating topic, the foundational pieces don’t seem to be that onerous to understand. Many libraries make it easy to start exploring data science and NLP. Completing these three projects: