This article was published as a part of the Data Science Blogathon.

Hello Learners, Welcome!

In this article, we are going to learn about PCA and its implementation on the MNIST dataset. In this article, we are going to implement the Principal Component Analysis(PCA) technic on the MNIST dataset from scratch. but before we apply PCA technic to the MNIST dataset, we will first learn what is PCA, the geometric interpretation of PCA, the mathematical formulation of PCA, and the implementation of PCA on the MNIST dataset.

So the PCA is the technic of dimensionality reduction. dimensionality reduction is nothing but the reduction of n dimension data to n’ dimension data, where n > n’. there are several types of datasets that have lots of features and this feature is nothing but the extent of data points or datasets. Also, in the dataset, several features have less impact on the final result and increase the processing time of the machine learning models. and we humans can visualize the data only in 2D and 3D but we can’t imagine the higher dimensions. Solving this visualization and optimization problem, there are lots of technics available in machine learning such as PCA, t-SNE, Random Forests, kernel PCA, Truncated SVD, etc.

So the dataset we are going to use in this article is called the MNIST dataset, which contains the information of handwritten digits 0 to 9. in this dataset the information of single-digit is stored in the form of 784*1 array, where the single element of 784*1 array represents a single pixel of 28*28 image. here the value of single-pixel varies from 0 to 1, where the black colour is represented by 1 and white by 0 and middle values represent the shades of grey.

So basically the work of PCA is to reduce the dimensions of a given dataset. which means if we were given the dataset which has d-dimensional data then our task is to convert the data into d’-dimensional data where d > d’. so for understanding the geometric interpretation of PCA we will take an example of a 2d dataset and convert it into 1d data set because we can’t imagine the data more than 3d. but **anything **we learn from 2d interpretation, we can also do it to higher dimensions.

Now let’s take an example, Suppose we have a DxN** **dimensional dataset called X, where the d = 2 and n = 20. and the two features of the dataset is f1 and f1,

Now let’s see that we make the scatter plot with this data and its data distribution is look like the figure shown below,

After seeing the scatter plot, you can easily say that the variance of feature f1 is much more than the variance of feature f2. The variability of f2 is unimportant compared to the variability of f1. if we have to choose one feature between f1and f1, we can easily select the feature f1. now let’s suppose** **that you cannot visualize 2d data and for visualizing the data you have to convert your 2d data into 1d data then what do you do? so the simple answer is you directly keep those features that have the highest variance. and remove those features which have less impact on the overall result. and that’s what PCA internally does.

So first of all we ensure that our data is standardized because performing the PCA on standardized data becomes much easier than original data.

So now again let’s see that we have a d*n dimensional dataset called X, where the d = 2 and n = 20. and the two features of the dataset are f1 and f2. and remember we standardized the data. but in this case, the scatter plot looks like this.

In this case, if we have to decrease dimensions from 2d to 1d then we can’t clearly select feature f1 or f2 because this time the variance of both features is almost the same both the features seem important. so how does PCA do it?

In this situation, PCA tries to draw the vector of line in the direction where the variance of data is very high. which means instead of projecting the data or measuring the variance in the f1 or f1 axis what if we quantify the variance in the f1′ or f2′ direction because measuring the variance in the f1′ or f2′ direction makes much more sense.

So PCA tries to find the direction of vector or line where the variance of data is very high. the direction of vector where the variance of data is highest is called PC1 ( Principal Component 1 ) and second-highest is called PC2 and third is PC3 and so on.

So we show the geometric intuition of PCA, we show that how does PCA reduces the dimensions of data. so PCA simply finds the direction and draws the vector where the variance of data is very high, but you might wonder how the PCA does it and how it finds the right direction of vector where the variance of data is very high. how the PCA calculates the angle and gives us the accurate slope. so PCA uses two techniques to find the angle of a vector. the two methods are Variance maximization and Distance Minimization. so let’s learn about them in brief

**1. Variance Maximization:** In this method, we simply project all the data points on the unit vector u1 and find the variance of all projected data points. We select that direction where the variance of projected points is maximum.

So let’s assume that we have two-dimensional datasets and the features of the dataset are f1 and f2, and xi is datapoint and u1 is our unit vector. and if we project the data point xi on u1 the projected point is xi’,

u1 = unit vector

|| u1 || = 1 (length of unit vector)

f1 and f2 = features of dataset

xi = data point

xi’ = projection of xi on u1

now assume that D = { xi } (1 to n) is our dataset

and D’ = { xi’ } (1 to n) is our dataset of projected point of xi on u1.

xi’ = (u1 * xi)/||u1||

assumption that u1 is unit vector so, length of unit vector ||u1|| = 1

=> xi’ = u1 * xi

=> xi’ = u1T * xi …….(1)

now x^’ = u1T * x^ ……..(2) [ x^ = mean of x ]

so find u1 such that the variance{ projection of xi on u1 } is maximum

var {u1T * xi} (i is 1 to n)

if data is columns standardized then mean = 0 and variance = 1

so x^ = [0, 0, 0… .. . . . .0]

=> u1T * x^ = 0

we want to maximize the variance.

**2. Distance Minimization:** So in this technique of PCA we are trying to minimize the distance of data point from u1 ( unit vector of length 1)

|| xi ||2 = di2 + (u1T * xi)2 [ Pythagoras theorem ]

di2 = || xi ||2 – (u1T * xi)2

=> di2 = xiT * xi – (u1T * xi )2

we want to minimize the sum of all distance squared.

So as we talked about the MNIST dataset earlier and we just complete our understanding of PCA so it is the best time to perform the dimensionality reduction technique PCA on the MNIST dataset and the implementation will be from scratch so without wasting any more time lets start it,

So first of all we import our mandatory python libraries which are required for the implementation of PCA.

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns

Now let’s load our MNIST dataset from our computer which is stored in .csv format. we only imported 20000k rows for simplicity you can download the MNIST dataset from this link: https://www.kaggle.com/c/digit-recognizer/data

df = pd.read_csv('mnist_train.csv', nrows = 20000) print("the shape of data is :", df.shape) df.head()

**Hit Run to see the output**

Extracting label column from the dataset

label = df['label'] df.drop('label', axis = 1, inplace = True) ind = np.random.randint(0, 20000) plt.figure(figsize = (20, 5)) grid_data = np.array(df.iloc[ind]).reshape(28,28) plt.imshow(grid_data, interpolation = None, cmap = 'gray') plt.show() print(label[ind])

Plotting a random sample data point from The dataset using matplotlib imshow() method

Column standardization of our dataset using StandardScalar class of sklearn.preprocessing module. because after column standardization of our data the mean of every feature becomes 0 (zero) and variance 1. so we perform PCA from the origin point.

from sklearn.preprocessing import StandardScaler scaler = StandardScaler() std_df = scaler.fit_transform(df) std_df.shape

Now Find the Co-Variance matrix which is AT * A using NumPy matmul method. after multiplication, the dimensions of our Co-Variance matrix is 784 * 784 because AT(784 * 20000) * A(20000 * 784).

covar_mat = np.matmul(std_df.T, std_df) covar_mat.shape

Finding the top two Eigen-values and corresponding eigenvectors for projecting onto a 2D surface. The parameter ‘eigvals’ is defined (low value to high value), the eigh function will return the eigenvalues in ascending order and this code generates only the top 2 (782 and 783) eigenvalues.

converting the eigenvectors into (2,d) form for easiness of further computations

from scipy.linalg import eigh values, vectors = eigh(covar_mat, eigvals = (782, 783)) print("Dimensions of Eigen vector:", vectors.shape) vectors = vectors.T print("Dimensions of Eigen vector:", vectors.shape)

here the vectors[1] represent the eigenvector corresponding 1st principal eigenvector

here the vectors[0] represent the eigenvector corresponding 2nd principal eigenvector

If we multiply the two top vectors to the Co-Variance matrix, we found our two principal components PC1 and PC2.

final_df = np.matmul(vectors, std_df.T) print("vectros:", vectors.shape, "n", "std_df:", std_df.T.shape, "n", "final_df:", final_df.shape)

Now we vertically stack our final_df and label and then Transpose them, then we found the NumPy data table so with the help of pd.DataFrame we create the data frame of our two components with class labels.

final_dfT = np.vstack((final_df, label)).T dataFrame = pd.DataFrame(final_dfT, columns = ['pca_1', 'pca_2', 'label']) dataFrame

Now let’s visualize the final data with help of the seaborn FacetGrid method.

sns.FacetGrid(dataFrame, hue = 'label', size = 8) .map(sns.scatterplot, 'pca_1', 'pca_2') .add_legend() plt.show()

So you can see that we are successfully converted our 20000*785 data to 20000*3 using PCA. So this is how PCA is used to convert big extent to smaller ones.

What do we learn in this article? We took a brief intro about the PCA and mathematical intuition of PCA. This was all from me thank you for reading this article. I am currently pursuing a b.tech in CSE I loved to write articles in data Science. Hope you like this article.

Thank you.

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist
##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask