This article was published as a part of the Data Science Blogathon.

Consider the following scenario: you are a product manager who wants to categorize customer feedback into two categories: favorable and unfavorable. Or As a loan manager, do you want to know which loan applications are safe to lend to and which ones are risky? As a healthcare analyst, you want to be able to forecast which patients are likely to develop diabetic complications. All of the instances have the same kind of challenge when it comes to categorizing reviews, loan applications, and patients, among other things.

Naive Bayes is the easiest and rapid classification method available, and it is well suited for dealing with enormous amounts of information. In several applications such as spam filtering, text classification, sentiment analysis, and recommender systems, the Naive Bayes classifier has shown to be effective. It makes predictions about unknown classes using the Bayes theory of probability.

We will go through the Naive Bayes classification course in Python Sklearn in this article. We will explain what is Naive Bayes algorithm is and continue to view an end-to-end example of implementing the Gaussian Naive Bayes classifier in Sklearn using a dataset.

Naive Bayes is a basic but effective probabilistic classification model in machine learning that draws influence from Bayes Theorem.

Bayes theorem is a formula that offers a conditional probability of an event A taking happening given another event B has previously happened. Its mathematical formula is as follows: –

Where

- A and B are two events
- P(A|B) is the probability of event A provided event B has already happened.
- P(B|A) is the probability of event B provided event A has already happened.
- P(A) is the independent probability of A
- P(B) is the independent probability of B

Now, this Bayes theorem can be used to generate the following classification model –

Where

- X = x1,x2,x3,.. xN аre list оf indeрendent рrediсtоrs
- y is the class label
- P(y|X) is the probability of label y given the predictors X

The above equation may be extended as follows:

- The Naive Bayes method makes the assumption that the predictors contribute equally and independently to selecting the output class.
- Although the Naive Bayes model’s assumption that all predictors are independent of one another is unfeasible in real-world circumstances, this assumption produces a satisfactory outcome in the majority of instances.
- Naive Bayes is often used for text categorization since the dimensionality of the data is frequently rather large.

Naive Bayes Classifiers are classified into three categories —

**i) Gaussian Naive Bayes**

This classifier is employed when the predictor values are continuous and are expected to follow a Gaussian distribution.

**ii) Bernoulli Naive Bayes**

When the predictors are boolean in nature and are supposed to follow the Bernoulli distribution, this classifier is utilized.

**iii) Multinomial Naive Bayes**

This classifier makes use of a multinomial distribution and is often used to solve issues involving document or text classification.

We will walk you through an end-to-end demonstration of the Gaussian Naive Bayes classifier in Python Sklearn using a cancer dataset in this part. For our example, we’ll use SKlearn’s Gaussian Naive Bayes function, i.e. GaussianNB().

We’ll begin by loading some basic libraries that will be used to import and view the dataset.

import numpy as np import pandas as pd import matplotlib.pyplot as plt

Now, we’ll submit the cancer detection dataset from Kaggle that we used to do our Naive Bayes classification.

dataset = pd.read_csv("datasets/cancer.csv")

Let’s take a quick look at the dataset using the head() method.**Python Code:**

Following that, we’ll analyze the columns included inside the dataset using the info() method.

**Input:**

dataset.info()

**Output:**

RangeIndex: 569 entries, 0 to 568 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 569 non-null int64 1 diagnosis 569 non-null object 2 radius_mean 569 non-null float64 3 texture_mean 569 non-null float64 4 perimeter_mean 569 non-null float64 5 area_mean 569 non-null float64 6 smoothness_mean 569 non-null float64 7 compactness_mean 569 non-null float64 8 concavity_mean 569 non-null float64 9 concave points_mean 569 non-null float64 10 symmetry_mean 569 non-null float64 11 fractal_dimension_mean 569 non-null float64 12 radius_se 569 non-null float64 13 texture_se 569 non-null float64 14 perimeter_se 569 non-null float64 15 area_se 569 non-null float64 16 smoothness_se 569 non-null float64 17 compactness_se 569 non-null float64 18 concavity_se 569 non-null float64 19 concave points_se 569 non-null float64 20 symmetry_se 569 non-null float64 21 fractal_dimension_se 569 non-null float64 22 radius_worst 569 non-null float64 23 texture_worst 569 non-null float64 24 perimeter_worst 569 non-null float64 25 area_worst 569 non-null float64 26 smoothness_worst 569 non-null float64 27 compactness_worst 569 non-null float64 28 concavity_worst 569 non-null float64 29 concave points_worst 569 non-null float64 30 symmetry_worst 569 non-null float64 31 fractal_dimension_worst 569 non-null float64 32 Unnamed: 32 0 non-null float64 dtypes: float64(31), int64(1), object(1) memory usage: 146.8+ KB

We can see from the information above that the id and unnamed:32 columns are not relevant, so we can eliminate them.

**Input:**

dataset = dataset.drop(["id"], axis = 1)

**Input:**

dataset = dataset.drop(["Unnamed: 32"], axis = 1)

**Input:**

M = dataset[dataset.diagnosis == "M"]

__Benign Tumor Dataframe__

**Input:**

B = dataset[dataset.diagnosis == "B"]

We shall now examine malignant and benign tumors by examining their average radius and texture.

**Input:**

plt.title("Malignant vs Benign Tumor") plt.xlabel("Radius Mean") plt.ylabel("Texture Mean") plt.scatter(M.radius_mean, M.texture_mean, color = "red", label = "Malignant", alpha = 0.3) plt.scatter(B.radius_mean, B.texture_mean, color = "lime", label = "Benign", alpha = 0.3) plt.legend() plt.show()

**Output:**

Now, malignant tumors will be assigned a value of ‘1’ and benign tumors will be assigned a value of ‘0’.

**Input:**

dаtаset.diаgnоsis = [1 if i== "M" else 0 fоr i in dаtаset.diаgnоsis]

We now divide our dataframe into x and y components. The x variable includes all independent predictor factors, whereas the y variable provides the diagnostic prediction.

**Input:**

x = dataset.drop(["diagnosis"], axis = 1) y = dataset.diagnosis.values

To maximize the model’s efficiency, it’s always a good idea to normalize the data to a common scale.

**Input:**

# Normalization: x = (x - nр.min(x)) / (nр.mаx(x) - nр.min(x))

After that, we’ll use the train test split module from the sklearn package to divide the dataset into training and testing sections.

**Input:**

from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 42)

Now we’ll import and instantiate the Gaussian Naive Bayes module from SKlearn GaussianNB. To fit the model, we may pass x_train and y_train.

**Input:**

from sklearn.naive_bayes import GaussianNB nb = GaussianNB() nb.fit(x_train, y_train)

**Output:**

GaussianNB()

The following accuracy score reflects how successfully our Sklearn Gaussian Naive Bayes model predicted cancer using the test data.

**Input:**

print("Naive Bayes score: ",nb.score(x_test, y_test))

**Output:**

Naive Bayes score: 0.935672514619883

A. To use the Naive Bayes classifier in Python using scikit-learn (sklearn), follow these steps:

1. Import the necessary libraries: `from sklearn.naive_bayes import GaussianNB`

2. Create an instance of the Naive Bayes classifier: `classifier = GaussianNB()`

3. Fit the classifier to your training data: `classifier.fit(X_train, y_train)`

4. Predict the target values for your test data: `y_pred = classifier.predict(X_test)`

5. Evaluate the performance of the classifier: `accuracy = classifier.score(X_test, y_test)`

A. No, Naive Bayes is not considered a lazy classifier. The term “lazy classifier” typically refers to algorithms that delay the learning process until the time of prediction. These algorithms store the training instances and use them directly during the prediction phase.

In contrast, Naive Bayes is an example of an eager or “generative” classifier. It learns a probabilistic model based on the training data during the training phase, and this model is then used to make predictions on new, unseen instances without requiring the original training data at prediction time.

Naive Bayes is the simplest and most powerful algorithm. Despite recent major breakthroughs in Machine Learning, it has shown its utility. It’s been used in applications ranging from text analytics to recommendation systems.

After explaining Naive Bayes and demonstrating an end-to-end implementation of Gaussian Naive Bayes in Sklearn using the Cancer dataset, we have reached the finish of this article. Thank you for reading it! I really hope you found this brief introductory training to be informative.

I hope you like the content. If you’d like to contact me, you may do so via:

or you can send me an email if you have any further queries.

**The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion**

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist
##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask