PCA(Principal Component Analysis) on MNIST Dataset

Mayur Badole 31 Mar, 2023

8 min read

This article was published as a part of the Data Science Blogathon.

Introduction to PCA

Hello Learners, Welcome!

In this article, we are going to learn about PCA and its implementation on the MNIST dataset. In this article, we are going to implement the Principal Component Analysis(PCA) technic on the MNIST dataset from scratch. but before we apply PCA technic to the MNIST dataset, we will first learn what is PCA, the geometric interpretation of PCA, the mathematical formulation of PCA, and the implementation of PCA on the MNIST dataset.

So the PCA is the technic of dimensionality reduction. dimensionality reduction is nothing but the reduction of n dimension data to n’ dimension data, where n > n’. there are several types of datasets that have lots of features and this feature is nothing but the extent of data points or datasets. Also, in the dataset, several features have less impact on the final result and increase the processing time of the machine learning models. and we humans can visualize the data only in 2D and 3D but we can’t imagine the higher dimensions. Solving this visualization and optimization problem, there are lots of technics available in machine learning such as PCA, t-SNE, Random Forests, kernel PCA, Truncated SVD, etc.

So the dataset we are going to use in this article is called the MNIST dataset, which contains the information of handwritten digits 0 to 9. in this dataset the information of single-digit is stored in the form of 784*1 array, where the single element of 784*1 array represents a single pixel of 28*28 image. here the value of single-pixel varies from 0 to 1, where the black colour is represented by 1 and white by 0 and middle values represent the shades of grey.

Image from MNIST dataset

Geometric Interpretation of PCA:

So basically the work of PCA is to reduce the dimensions of a given dataset. which means if we were given the dataset which has d-dimensional data then our task is to convert the data into d’-dimensional data where d > d’. so for understanding the geometric interpretation of PCA we will take an example of a 2d dataset and convert it into 1d data set because we can’t imagine the data more than 3d. but anything we learn from 2d interpretation, we can also do it to higher dimensions.

Now let’s take an example, Suppose we have a DxN dimensional dataset called X, where the d = 2 and n = 20. and the two features of the dataset is f1 and f1,

Now let’s see that we make the scatter plot with this data and its data distribution is look like the figure shown below,

Scatter Plot

After seeing the scatter plot, you can easily say that the variance of feature f1 is much more than the variance of feature f2. The variability of f2 is unimportant compared to the variability of f1. if we have to choose one feature between f1and f1, we can easily select the feature f1. now let’s suppose that you cannot visualize 2d data and for visualizing the data you have to convert your 2d data into 1d data then what do you do? so the simple answer is you directly keep those features that have the highest variance. and remove those features which have less impact on the overall result. and that’s what PCA internally does.

So first of all we ensure that our data is standardized because performing the PCA on standardized data becomes much easier than original data.

So now again let’s see that we have a d*n dimensional dataset called X, where the d = 2 and n = 20. and the two features of the dataset are f1 and f2. and remember we standardized the data. but in this case, the scatter plot looks like this.

PCA on MNIST dataset

In this case, if we have to decrease dimensions from 2d to 1d then we can’t clearly select feature f1 or f2 because this time the variance of both features is almost the same both the features seem important. so how does PCA do it?

In this situation, PCA tries to draw the vector of line in the direction where the variance of data is very high. which means instead of projecting the data or measuring the variance in the f1 or f1 axis what if we quantify the variance in the f1′ or f2′ direction because measuring the variance in the f1′ or f2′ direction makes much more sense.

So PCA tries to find the direction of vector or line where the variance of data is very high. the direction of vector where the variance of data is highest is called PC1 ( Principal Component 1 ) and second-highest is called PC2 and third is PC3 and so on.

Mathematical Formulation of PCA:

So we show the geometric intuition of PCA, we show that how does PCA reduces the dimensions of data. so PCA simply finds the direction and draws the vector where the variance of data is very high, but you might wonder how the PCA does it and how it finds the right direction of vector where the variance of data is very high. how the PCA calculates the angle and gives us the accurate slope. so PCA uses two techniques to find the angle of a vector. the two methods are Variance maximization and Distance Minimization. so let’s learn about them in brief

1. Variance Maximization: In this method, we simply project all the data points on the unit vector u1 and find the variance of all projected data points. We select that direction where the variance of projected points is maximum.

Variance Maximization for PCA on MNIST dataset

So let’s assume that we have two-dimensional datasets and the features of the dataset are f1 and f2, and xi is datapoint and u1 is our unit vector. and if we project the data point xi on u1 the projected point is xi’,

projecting the data point xi on u1 for PCA on MNIST dataset

u1 = unit vector

|| u1 || = 1 (length of unit vector)

f1 and f2 = features of dataset

xi = data point

xi’ = projection of xi on u1

now assume that D = { xi } (1 to n) is our dataset

and D’ = { xi’ } (1 to n) is our dataset of projected point of xi on u1.

xi’ = (u1 * xi)/||u1||

assumption that u1 is unit vector so, length of unit vector ||u1|| = 1

=> xi’ = u1 * xi

=> xi’ = u1T * xi …….(1)

now x^’ = u1T * x^ ……..(2) [ x^ = mean of x ]

so find u1 such that the variance{ projection of xi on u1 } is maximum

var {u1T * xi} (i is 1 to n)

When the variance{ projection of xi on u1 } is maximum

if data is columns standardized then mean = 0 and variance = 1

so x^ = [0, 0, 0… .. . . . .0]

=> u1T * x^ = 0

we want to maximize the variance.

2. Distance Minimization: So in this technique of PCA we are trying to minimize the distance of data point from u1 ( unit vector of length 1)

Distance Minimization

|| xi ||2 = di2 + (u1T * xi)2 [ Pythagoras theorem ]

di2 = || xi ||2 – (u1T * xi)2

=> di2 = xiT * xi – (u1T * xi )2

we want to minimize the sum of all distance squared.

minimizing the sum of all distance squared

Implementing PCA on MNIST dataset:

So as we talked about the MNIST dataset earlier and we just complete our understanding of PCA so it is the best time to perform the dimensionality reduction technique PCA on the MNIST dataset and the implementation will be from scratch so without wasting any more time lets start it,

So first of all we import our mandatory python libraries which are required for the implementation of PCA.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Now let’s load our MNIST dataset from our computer which is stored in .csv format. we only imported 20000k rows for simplicity you can download the MNIST dataset from this link: https://www.kaggle.com/c/digit-recognizer/data

df = pd.read_csv('mnist_train.csv', nrows = 20000)
print("the shape of data is :", df.shape)
df.head()

Hit Run to see the output

Implementing PCA on MNIST dataset

Extracting label column from the dataset

label = df['label']
df.drop('label', axis = 1, inplace = True)
ind = np.random.randint(0, 20000)
plt.figure(figsize = (20, 5))
grid_data = np.array(df.iloc[ind]).reshape(28,28)
plt.imshow(grid_data, interpolation = None, cmap = 'gray')
plt.show()
print(label[ind])

Plotting a random sample data point from The dataset using matplotlib imshow() method

Plotting a random sample data point

Column standardization of our dataset using StandardScalar class of sklearn.preprocessing module. because after column standardization of our data the mean of every feature becomes 0 (zero) and variance 1. so we perform PCA from the origin point.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
std_df = scaler.fit_transform(df)
std_df.shape

Now Find the Co-Variance matrix which is AT * A using NumPy matmul method. after multiplication, the dimensions of our Co-Variance matrix is 784 * 784 because AT(784 * 20000) * A(20000 * 784).

covar_mat = np.matmul(std_df.T, std_df)
covar_mat.shape

Finding the top two Eigen-values and corresponding eigenvectors for projecting onto a 2D surface. The parameter ‘eigvals’ is defined (low value to high value), the eigh function will return the eigenvalues in ascending order and this code generates only the top 2 (782 and 783) eigenvalues.

converting the eigenvectors into (2,d) form for easiness of further computations

from scipy.linalg import eigh
values, vectors = eigh(covar_mat, eigvals = (782, 783))
print("Dimensions of Eigen vector:", vectors.shape)
vectors = vectors.T
print("Dimensions of Eigen vector:", vectors.shape)

converting the eigenvectors into (2,d) form for easiness

here the vectors[1] represent the eigenvector corresponding 1st principal eigenvector

here the vectors[0] represent the eigenvector corresponding 2nd principal eigenvector

If we multiply the two top vectors to the Co-Variance matrix, we found our two principal components PC1 and PC2.

final_df = np.matmul(vectors, std_df.T)
print("vectros:", vectors.shape, "n", "std_df:", std_df.T.shape, "n", "final_df:", final_df.shape)

our two principal components PC1 and PC2

Now we vertically stack our final_df and label and then Transpose them, then we found the NumPy data table so with the help of pd.DataFrame we create the data frame of our two components with class labels.

final_dfT = np.vstack((final_df, label)).T
dataFrame = pd.DataFrame(final_dfT, columns = ['pca_1', 'pca_2', 'label'])
dataFrame

creating the data frame of our two components with class labels

Now let’s visualize the final data with help of the seaborn FacetGrid method.

sns.FacetGrid(dataFrame, hue = 'label', size = 8)
  .map(sns.scatterplot, 'pca_1', 'pca_2')
  .add_legend()
plt.show()

Visualizing the data for PCA on MNIST dataset

So you can see that we are successfully converted our 20000*785 data to 20000*3 using PCA. So this is how PCA is used to convert big extent to smaller ones.

EndNote

What do we learn in this article? We took a brief intro about the PCA and mathematical intuition of PCA. This was all from me thank you for reading this article. I am currently pursuing a b.tech in CSE I loved to write articles in data Science. Hope you like this article.

Thank you.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.