Understanding text classification in NLP with Movie Review Example Example
This article was published as a part of the Data Science Blogathon.
Introduction
Artificial intelligence has been improved tremendously without needing to change the underlying hardware infrastructure. Users can run an Artificial intelligence program in an old computer system. On the other hand, the beneficiary effect of machine learning is unlimited. Natural Language Processing is one of the branches of AI that gives the machines the ability to read, understand, and deliver meaning. NLP has been very successful in healthcare, media, finance, and human resource.
The most common form of unstructured data is texts and speeches. It’s plenty but hard to extract useful information. If not, it would take a long time to mine the information. Written text and speech contain rich information. It’s because we, as intelligent beings, use writing and speaking as the primary form of communication. NLP can analyze these data for us and do the task like sentiment analysis, cognitive assistant, span filtering, identifying fake news, and real-time language translation.
This article will cover how NLP understands the texts or parts of speech. Mainly we will be focusing on Words and Sequence Analysis. It includes text classification, vector semantic and word embedding, probabilistic language model, sequential labeling, and speech reorganization. We will look at the sentiment analysis of fifty thousand IMDB movie reviewer. Our goal is to identify whether the review posted on the IMDB site by its user is positive or negative.
Topic List
- Understand what NLP is?
- What does NLP use for?
- Words and Sequences
- Text classification
- Vector Semantic and Word embedding
- Probabilistic Language Models
- Sequence labeling
- Parsers
- Semantics
- Performing Semantic Analysis on IMDB movie review data project
NLP has widely used in cars, smartphones, speakers, computers, websites, etc. Google Translator usage machine translator which is the NLP system. Google Translator wrote and spoken natural language to desire language users want to translate. NLP helps google translator to understand the word in context, remove extra noises, and build CNN to understand native voice.
NLP is also popular in chatbots. Chatbots is very useful because it reduces the human work of asking what customer needs. NLP chatbot cans ask sequential questions like what the user problem is and where to find the solution. Apple and AMAZON have a robust chatbot in their system. When the user asks some questions, the chatbot converts them into understandable phrases in the internal system.
It’s call toke. Then token goes into NLP to get the idea of what users are asking. NLP is used in information retrieval (IR). IR is a software program that deals with large storage, evaluation of information from large text documents from repositories. It will retrieve only relevant information. For example, it is used in google voice detection to trim unnecessary words.
Application of NLP
- Machine Translation i.e. Google Translator
- Information retrieval
- Question Answering i.e. ChatBot
- Summarization
- Sentiment Analysis
- Social Media Analysis
- Mining large data\
Words and Sequences
NLP system needs to understand text, sign, and semantic properly. Many methods help the NLP system to understand text and symbols. They are text classification, vector semantic, word embedding, probabilistic language model, sequence labeling, and speech reorganization.
-
Text classification
Text clarification is the process of categorizing the text into a group of words. By using NLP, text classification can automatically analyze text and then assign a set of predefined tags or categories based on its context. NLP is used for sentiment analysis, topic detection, and language detection. There is mainly three text classification approach-
- Rule-based System,
- Machine System
- Hybrid System.
In the rule-based approach, texts are separated into an organized group using a set of handicraft linguistic rules. Those handicraft linguistic rules contain users to define a list of words that are characterized by groups. For example, words like Donald Trump and Boris Johnson would be categorized into politics. People like LeBron James and Ronaldo would be categorized into sports.
Machine-based classifier learns to make a classification based on past observation from the data sets. User data is prelabeled as tarin and test data. It collects the classification strategy from the previous inputs and learns continuously. Machine-based classifier usage a bag of a word for feature extension.
In a bag of words, a vector represents the frequency of words in a predefined dictionary of a word list. We can perform NLP using the following machine learning algorithms: Naïve Bayer, SVM, and Deep Learning.
The third approach to text classification is the Hybrid Approach. Hybrid approach usage combines a rule-based and machine Based approach. Hybrid based approach usage of the rule-based system to create a tag and use machine learning to train the system and create a rule. Then the machine-based rule list is compared with the rule-based rule list. If something does not match on the tags, humans improve the list manually. It is the best method to implement text classification
-
Vector Semantic
Vector Semantic is another way of word and sequence analysis. Vector semantic defines semantic and interprets words meaning to explain features such as similar words and opposite words. The main idea behind vector semantic is two words are alike if they have used in a similar context. Vector semantic divide the words in a multi-dimensional vector space. Vector semantic is useful in sentiment analysis.
-
Word Embedding
Word embedding is another method of word and sequence analysis. Embedding translates spares vectors into a low-dimensional space that preserves semantic relationships. Word embedding is a type of word representation that allows words with similar meaning to have a similar representation. There are two types of word embedding-
- Word2vec
- Doc2Vec.
Word2Vec is a statistical method for effectively learning a standalone word embedding from a text corpus.
Doc2Vec is similar to Doc2Vec, but it analyzes a group of text like pages.
-
Probabilistic Language Model
Another approach to word and sequence analysis is the probabilistic language model. The goal of the probabilistic language model is to calculate the probability of a sentence of a sequence of words. For example, the probability of the word “a” occurring in a given word “to” is 0.00013131 percent. -
Sequence Labeling
Sequence labeling is a typical NLP task that assigns a class or label to each token in a given input sequence. If someone says “play the movie by tom hanks”. In sequence, labeling will be [play, movie, tom hanks]. Play determines an action. Movies are an instance of action. Tom Hanks goes for a search entity. It divides the input into multiple tokens and uses LSTM to analyze it. There are two forms of sequence labeling. They are token labeling and span labeling.
Parsing is a phase of NLP where the parser determines the syntactic structure of a text by analyzing its constituent words based on an underlying grammar. For example, “tom ate an apple” will be divided into proper noun tom, verb ate, determiner , noun apple. The best example is Amazon Alexa.
We discuss how text is classified and how to divide the word and sequence so that the algorithm can understand and categorize it. In this project, we are going to discover a sentiment analysis of fifty thousand IMDB movie reviewer. Our goal is to identify whether the review posted on the IMDB site by its user is positive or negative.
This project covers text mining techniques like Text Embedding, Bags of Words, word context, and other things. We will also cover the introduction of a bidirectional LSTM sentiment classifier. We will also look at how to import a labeled dataset from TensorFlow automatically. This project also covers steps like data cleaning, text processing, data balance through sampling, and train and test a deep learning model to classify text.
Parsing
Parser determines the syntactic structure of a text by analyzing its constituent words based on an underlying grammar. It divides group words into component parts and separates words.
For more details about parsing, check this article.
Semantic
Example Application
Here is the code Sample:
Importing necessary library
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python # For example, here's several helpful packages to load import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) # Input data files are available in the read-only "../input/" directory # For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory import os for dirname, _, filenames in os.walk('/kaggle/input'): for filename in filenames: print(os.path.join(dirname, filename)) # You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" # You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session #Importing require Libraries import os import matplotlib.pyplot as plt import nltk from tkinter import * import seaborn as sns import matplotlib.pyplot as plt sns.set() import scipy import tensorflow as tf import tensorflow_hub as hub import tensorflow_datasets as tfds from tensorflow.python import keras from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Embedding, LSTM from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix from sklearn.metrics import classification_report
Downloading necessary file
# this cells takes time, please run once # Split the training set into 60% and 40%, so we'll end up with 15,000 examples # for training, 10,000 examples for validation and 25,000 examples for testing. original_train_data, original_validation_data, original_test_data = tfds.load( name="imdb_reviews", split=('train[:60%]', 'train[60%:]', 'test'), as_supervised=True)
Getting word index from Keras datasets
#tokanizing by tensorflow word_index = tf.keras.datasets.imdb.get_word_index( path='imdb_word_index.json'
)
{k:v for (k,v) in word_index.items() if v < 20}
{'with': 16, 'i': 10, 'as': 14, 'it': 9, 'is': 6, 'in': 8, 'but': 18, 'of': 4, 'this': 11, 'a': 3, 'for': 15, 'br': 7, 'the': 1, 'was': 13, 'and': 2, 'to': 5, 'film': 19, 'movie': 17, 'that': 12}
Positive and Negative Review Comparision

Creating Train, Test Data

Model and Model Summary

Splitting data and fitting the model

Model effect Overview

Confusion Matrix and Correlation Report

Note: Data Source and Data for this model is publicly available and can be accessed by using Tensorflow.
For the complete code and details, please follow this GitHub Repository.
In conclusion, NLP is a field full of opportunities. NLP has a tremendous effect on how to analyze text and speeches. NLP is doing better and better every day. Knowledge extraction from the large data set was impossible five years ago. The rise of the NLP technique made it possible and easy. There are still many opportunities to discover in NLP.