Learn everything about Analytics

Home » Natural Language Processing Step by Step Guide

Natural Language Processing Step by Step Guide

This article was published as a part of the Data Science Blogathon

Overview

  • Basic understanding of Natural Language Processing.
  • Learn Various Techniques used for the implementation of NLP.
  • Understand how to use NLP for text mining.

Prerequisite

  • You must have a basic knowledge of Python.

As we know every piece of data has some meaning in its position. Most important is that text data is getting generated in various formats like reviews, SMS, emails, and many more for every moment. The main purpose of this article is to understand the basic idea of NLP and how it will impact our day-to-day life. So let’s go.

Introduction

NLP stands for Natural Language Processing, a part of Computer Science, Human Language, and Artificial Intelligence. This technology is used by computers to understand, analyze, manipulate, and interpret human languages.

NLP algorithms are widely used everywhere in areas like Gmail spam, any search, games, and many more.

 

Why NLP is so important?

  • Text data in a massive amount:

NLP helps machines to interact with humans in their language and perform related tasks like reading text, understand speech and interpret it in well format. Nowadays machines can analyze more data rather than humans efficiently. All of us know that every day plenty amount of data is generated from various fields such as the medical and pharma industry, social media like Facebook, Instagram, etc. And this data is not well structured (i.e. unstructured) so it becomes a tedious job, that’s why we need NLP.

  • Unstructured data to structured:

We know that supervised and unsupervised learning and deep learning are now extensively used to manipulate human language. That’s why we need a proper understanding of the text. I am going to explain this understanding in this article.NLP is very important to get exact or useful insights from text. Meaningful information is gathered

 

Components of NLP

NLP is divided into two components.

  • Natural Language Understanding
  • Natural Language Generation
Natural Language Processing venn
Components of NLP

Natural Language Understanding:-

Natural Language Understanding (NLU) helps the machine to understand and analyze human language by extracting the text from large data such as keywords, emotions, relations, and semantics, etc.

Let’s see what challenges are faced by a machine-

For Example:-

  • He is looking for a match.

What do you understand by the ‘match’ keyword? Does it partner or cricket or football or anything else?

This is Lexical Ambiguity. It happens when a word has different meanings. Lexical ambiguity can be resolved by using parts-of-speech (POS)tagging techniques.

  • The Fish is ready to eat.

What do you understand by the above example? Is the fish ready to eat his/her food or fish is ready for someone to eat? Got confused!! Right? We will see it practically below.

This is Syntactical Ambiguity which means when we see more meanings in a sequence of words and also Called Grammatical Ambiguity.

Natural Language Generation:-

It is the process of extracting meaningful insights as phrases and sentences in the form of natural language.

It consists −

  • Text planning − It includes retrieving the relevant data from the domain.
  • Sentence planning − It is nothing but a selection of important words, meaningful phrases, or sentences.

 

Phases of NLP

 

Natural Language Processing phases
Phases of NLP

-Lexical Analysis:

It involves identifying and analyzing the structure of words. Lexicon of a language means the collection of words and phrases in that particular language. The lexical analysis divides the text into paragraphs, sentences, and words. So we need to perform Lexicon Normalization.

The most common lexicon normalization techniques are Stemming:

  • Stemming: Stemming is the process of reducing derived words to their word stem, base, or root form—generally a written word form like-“ing”, “ly”, “es”, “s”, etc
  • Lemmatization: Lemmatization is the process of reducing a group of words into their lemma or dictionary form. It takes into account things like POS(Parts of Speech), the meaning of the word in the sentence, the meaning of the word in the nearby sentences, etc. before reducing the word to its lemma.

-Syntactic Analysis:

Syntactic Analysis is used to check grammar, arrangements of words, and the interrelationship between the words.

Example: Mumbai goes to the Sara

Here “Mumbai goes to Sara”, which does not make any sense, so this sentence is rejected by the Syntactic analyzer.

Syntactical parsing involves the analysis of words in the sentence for grammar. Dependency Grammar and Part of Speech (POS)tags are the important attributes of text syntactic.

-Semantic Analysis:

Retrieves the possible meanings of a sentence that is clear and semantically correct. Its process of retrieving meaningful insights from text.

Discourse Integration:

It is nothing but a sense of context. That is sentence or word depends upon that sentences or words. It’s like the use of proper nouns/pronouns.

For example, Ram wants it.

In the above statement, we can clearly see that the “it” keyword does not make any sense. In fact, it is referring to anything that we don’t know. That is nothing but this “it” word depends upon the previous sentence which is not given. So once we get to know about “it”, we can easily find out the reference.

Pragmatic Analysis:

It means the study of meanings in a given language. Process of extraction of insights from the text. It includes the repetition of words, who said to whom? etc.

It understands that how people communicate with each other, in which context they are talking and so many aspects.

Okay! .. So at this point, we came to know that all the basic concepts of NLP.

Here we will discuss all these points practically …so let’s move on!

Implementation of NLP using Python

I am going to show you how to perform NLP using Python. Python is very simple, easy to understand and interpret.

First, we will import all necessary libraries as shown below:

# Importing the libraries
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In the above code, we have imported libraries such as pandas to deal with data frames/datasets, re for regular expression, nltk is a natural language tool kit in which we have imported modules like stopwords which is nothing but “dictionary” and PorterStemmer to generate root word.

df=pd.read_csv('Womens Clothing E-Commerce Reviews.csv',header=0,index_col=0)
df.head()
# Null Entries
df.isna().sum()

Here we have read the file named “Women’s Clothing E-Commerce Reviews” in CSV(comma-separated value) format. And also checked for null values.

You can find this dataset on this link:

import matplotlib.pyplot as plt
import seaborn as sns
sns.countplot(x='Rating',data=df_temp)
plt.title("Distribution of Rating")

Further, we will perform some data visualizations using matplotlib and seaborn libraries which are really the best visualization libraries in Python. I have taken only one graph, you can perform more graphs to see how your data is!

nltk.download('stopwords')
stops=stopwords.words("english")

From nltk library, we have to download stopwords for text cleaning.

review=df_temp[['Review','Recommended']]
pd.DataFrame(review)
def tokens(words):
    words = re.sub("[^a-zA-Z]"," ", words)
    text = words.lower().split()
    return " ".join(text)
review['Review_clear'] = review['Review'].apply(tokens)
review.head()
corpus=[]
for i in range(0,22628):
    Review=re.sub("[^a-zA-Z]"," ", df_temp["Review"][i])
    Review=Review.lower()
    Review=Review.split()
    ps=PorterStemmer()
    Review=[ps.stem(word) for word in Review if not word in set(stops)]
    tocken=" ".join(Review)
    corpus.append(tocken)

Here we will perform all operations of data cleaning such as lemmatization, stemming, etc to get pure data.

positive_words =[]

for i in positive.Review_clear:
    positive_words.append(i) 
positive_words = ' '.join(positive_words)
positive_words

Now it’s time to see how many positive words are there in “Reviews” from the dataset by using the above code.

negative_words = []
for j in Negative.Review_clear:
    negative_words.append(j)
negative_words = ' '.join(negative_words)
negative_words

Now it’s time to see how many negative words are there in “Reviews” from the dataset by using the above code.

# Library for WordCloud
from wordcloud import WordCloud
import matplotlib.pyplot as plt
wordcloud = WordCloud(background_color="white", max_words=len(negative_words))
wordcloud.generate(positive_words)
plt.figure(figsize=(13,13))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

By using the above code, we can simply show the word cloud of the most common words in the Reviews column in the dataset.

So, Finally, we have done all concepts with theory and implementation of NLP in Python…..!

Advantages of NLP

  • Removes unnecessary information.
  • NLP helps computers to interact with humans in their languages

Disadvantages of NLP

  • NLP may not show full context.
  • NLP is unpredictable sometimes.

Everyday NLP examples

There are many common day-to-day life applications of NLP. Apart from virtual assistants like Alexa or Siri, here are a few more examples you can see.

  • Email filtering. Spam messages whose content is malicious get automatically filtered by the Gmail system and put into the spam folder.
  • Autocorrection of any text by using techniques of NLP. Sometimes we see that in mobile chat application or google search our word/sentence get automatically autocorrected. This is because of NLP.

Conclusion

I hope you like my article. If you have any queries please comment below. Thank You!

The media shown in this article on Natural Language Processing are not owned by Analytics Vidhya and is used at the Author’s discretion.

You can also read this article on our Mobile APP Get it on Google Play