This article was published as a part of the Data Science Blogathon.

Sentiment Analysis, alternatively known as “Opinion Mining,” has been buzzing recently. And researchers worldwide are constantly focussing on developing newer techniques and architectures that would allow easy detection of the underlying tone or sentiment of a particular text. With the exponential growth of social websites, blogging sites, and electronic media, the amount of sentimental content in the form of movies or product reviews, user messages or testimonials, messages in discussion forums, etc., has also increased. Timely discovery of these sentimental or opinionated messages can potentially reap huge advantages – the most crucial being monetization. A better understanding of the sentiments of the masses towards products or services allows for better analysis of market trends, contextual advertisements, and ad recommender systems. While in the contemporary scenario, many advanced machine learning techniques and pipelines exist for sentiment analysis, it is always good to be aware of the basics. So the goal of this article would be to provide a mathematical yet detailed idea of the intricacies of one of the oldest workhouses of machine learning: the Naive Bayes’ (NB) Classifier.

The NB Classifier stems from the fundamental concept of Bayesian statistics – the conditional probability and Bayes’ Theorem. *The article expects the readers to be foundationally strong with the basic definitions of probability. So it is advised to recall those definitions and formulas before proceeding further.*

The definition of conditional probability or the probability of the occurrence of event A given event B, denoted byÂ P(A|B),Â is given by the joint probability P(A âˆ© B) divided by the marginalÂ probability P(B

WhereÂ â„¦ denotes the sample space.

Figure 2: Showing the intuition behind conditional probability

To think intuitively, as in figure 2, what would be the probability of being in ellipse A, given that you are already in ellipse B? Your answer would be that to

be also in A, we must be in the intersection (Aâˆ©B). Hence, the probability is equivalent to the number

of elements in the intersection (Aâˆ©B), divided by the number of elements in B, i.e., (B). Thus we have the formula for conditional probability as already stated.

Now assuming A and B are non-empty sets, for both A and B w can write,

We can rearrange the above two equations to write.

Rearranging the above equation, we get the final form of the Bayes’ theorem as

If the sample space can be divided into n number of mutually exclusive events, and if B

is an event with P(B) > 0, which is a subset of the union of all A_{i}, then for each A_{i}, the generalized Bayes’

formula is

The Naive Bayes’ (NB) Classifier is a well-known Bayesian network classifier paradigm of a supervised classification model. The NB is a probabilistic classifier based on the Bayes’ Theorem considering the naive independence assumption. It was earlier introduced under a different name in the NLP community. It remained a popular baseline model for text categorizing, theÂ problem of judging documents as belonging to oneÂ category or the other. The advantage of NB is that it requires less training data to estimate the parameters necessary for classification. With its fundamental driving formula being the Bayes’ theorem and the conditional probability model, the NB model, despite its simplicity and solid and robust assumptions, proves quite effective in most use cases, especially in the field of NLP, as we will see in the next section. Before that, it would be worthwhile to look into the assumptions that NB considers for classifying and how they affect the output from the perspective of NLP.

The NB classier mainly consider two assumptions that help reduce the model complexity.

1. It makes the independence assumption, whereby it assumes that the words in a corpus/ text are independent of each other, and there would exist no correlation between any two/group of words.

For example, consider the sentence “The Sahara desert is mostly sunny and hot.” Then, in that case,Â the word sunny and hot tend to depend on each other and are correlated to a certain extent with the word “desert.” Naive Bayes assumes independence throughout. This naive assumption is not always accurate and is one of the causes of low performance in some cases of the NB model.

2. It is heavily dependent on the relative frequencies of the words in the corpus taken into consideration. This sometimes leads to model inaccuracies, as existing datasets are usually noisy in the real world.

Now that we are thorough with the basic concept of NB and the assumptions, we are ready to see an application of the NB in sentiment analysis.

In this section of the article, we would be looking into how NB Classifier can be leveraged to carry out sentiment analysis.Â Figure 3Â shows steps to build the desired model. Each of the mentioned steps will be explained in detail in the paragraphs that follow.

For the purpose of the dataset, let us consider a corpus of tweets where two are labeled positive tweets and two are labeled negative tweets. Obviously, the numbers considered are just to ensure an enhanced understanding of the concepts, real-world scenarios would include much more diverse situations. So in the corpus, as shown inÂ Figure 4,Â the sentences “*I am happy because I am playing chess*” and “*I am happy, not sad*” are labeled as positive sentiments, and the sentences “*I am sad, I am not playing chess*” and “*I am sad, not happy*” are marked as negative sentiments.

Figure 4: Schematic representation of the corpus taken into consideration for this example walkthrough

This step is one of the most critical processes that need to be carried out prior to getting started with the model. The methods used in this step may vary from time and use but in general, in this step, you remove the punctuation marks and stop words, remove the handles and URLs in the text (if any), lowercase all the letters, and perform stemming on the entire text. The overall effect of the above methods is that they help reduce the vocabulary greatly and output a vector of words, which is then used in the further steps.

In this step, first, we get aÂ total count of all the words in the positive and negative corpus and prepare a table as shown in Figure 5.

This is the step, where ultimately you can use the formulas that we had earlier derived at the starting of the article. This step makes use of the Bayes’ Theorem to calculate the probability of occurrence of the word.

With the word count table, we try to get the probability table, as shown in Figure 6. We divide each word in a class by its corresponding sum of words in the class to get the probability of occurrence of that particular word. For example, the word count of “*I*” in positive tweets is 3 and the total number of words in the positive tweets is 13. So the corresponding value of “*I*” in the frequency table would be (3/13=) 0.24. In a similar fashion, the entries in the other cells are filled up to complete the table. To put it formally, we calculate the probability of a word given a class by the below formula.

An important concept to note here is that words like “*I*“, “*am*“, “*playing*“, and “*chess*” have the same probability, or in other words they are equiprobable – they do not add anything to the sentiment (as they get canceled out while calculating the likelihood value, something we would cover in the following steps). On the other hand, words like “*happy*“, “*sad*“, and “*not*” possess significant differences between the probabilities. These are also known as power words and they carry a lot of weight in determining sentiment.

However, one flaw with the above method of assigning the conditional probabilities is that if a word does not appear in the training, then its probability of occurrence is automatically equal to zero. In order to tackle this problem, a new concept of **Laplacian Smoothing** is introduced, whereby the formula of Bayes’ theorem is tweaked slightly to avoid getting a zero probability. The modified formula taking into consideration the Laplacian Smoothing thus becomes

Where V denotes the number of unique words in the vocabulary. For the case considered, it is equalÂ to 8.

The new probability table after considering the Laplacian smoothing thus becomes as shown in Figure 7. Note that the word “*because*” which had zero probability earlier in the negative class now has a non-zero probability.

Once the probabilities of each word are calculated, the last step that remains is to calculate a score that will allow us to decide whether a tweet is a positive or negative sentiment. An inference score greater than zero would denote a positive sentiment and a score less than zero would denote negative sentiment. A score equal to zero would denote neutral sentiment.

So, to get an inference about the result, we calculate the following :

where the first term is referred to as **log prior** and the second term as **log-likelihood**.

One thing to notice is that for a balanced dataset, where the amount of positive and negative data are equal, the log prior would be zero. Then calculating the likelihood only would suffice the inference.

We now have successfully walked through the steps required to train a Naive Bayes’ Classifier model and are in a position to get started on implementing this.

To recapitulate, the key takeaways from the article are:

1. First we started with a brief discussion on Sentiment Analysis and the various algorithms that can be used to achieve the same.

2. Then we briefly discussed the Bayes’ Theorem, which is the principal equation governing the Naive Bayes’ Classifier.

3. The following section discussed extensively the Naive Bayes’ (NB) Classifier and the assumptions leading to the model.

4. The last section dealt with implementing an NB classifier and we learned the various steps to develop the model along with the practical implementation of the various calculations involved in building the NB Classifier.

I hope this article gave you a clear understanding of the NB classifier. I would recommend you to further go ahead and do a practical implementation of the model with a real-world dataset to gather a more rigid understanding of the same.Â Go quick and try your hands at recommender systems with real-world datasets!

**The media shown in this article is not owned by Analytics Vidhya and is used at the Authorâ€™s discretion.**

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask