Principal Component Analysis Introduction and Practice Problem

Surabhi S 19 Sep, 2022

6 min read

This article was published as a part of the Data Science Blogathon.

Introduction

“Machine intelligence is the last invention that humanity will ever need to make”. The quote definitely makes it clear that machine learning is the future and vast opportunities and benefits for all. Let this be a fresh start for you to learn a really interesting algorithm in machine learning.

As you all know, we often come across the problems of storing and processing huge data in machine learning tasks, as it’s a time-consuming process and difficulties to interpret also arises. Not every feature of the data is necessary for predictions. These noisy data can lead to bad performances and overfitting of the model. Through this article let me introduce you to an unsupervised learning technique PCA(Principal Component Analysis) that can help you deal effectively with these issues to an extent and provide more accurate prediction results.

PCA was invented at the beginning of the 20th century by Karl Pearson, analogous to the principal axis theorem in mechanics and is widely used. Through this method, we actually transform the data into a new coordinate, where the one with the highest variance is the primary principal component. Thus providing us the best possible representations of data.

Gentle Overview

Data with numerous features may have correlations and duplications within. So once you get the data, the primary step is to clean it by removing irrelevant features, and applying feature engineering techniques by which may even provide better results than original features. Principal Component Analysis(PCA) is one such technique by which dimensionality reduction(linear transformation of existing attributes) and multivariate analysis are possible. It has several advantages, which include reduction of data size(hence faster execution), better visualizations with fewer dimensions, maximizes variance, reduces overfitting, etc.

The principal component actually means the sequences of direction vectors that differ on basis of best-fit lines. It can also be stated that these components are eigenvectors of the covariance matrix. We will look into that concept below.

How is this done? Initially, you need to find the principal components from different points of view during the training phase, from those you pick up the important and less correlated components and ignore the rest of them, thus reducing complexity. The number of principal components can be less than or equal to the total number of attributes.

Suppose two columns X and Y be the 2 features,

X Y

1 4

2 3

3 4

4 6

5 8

Mean

X’ = 3, Y’ = 5

Covariance

cov(x,y) = Σ (Xi – X’) (Yi – Y’)/ n – 1 , where i = 1 to n

C = [ cov(x,x) cov (x,y) ] [cov(y,x) cov(y,y) ]

Similarly, for more features, we find the whole covariance matrix with more dimensions. On further calculating eigenvalues, vectors, etc, we are able to find the principal components. Importing the algorithms and using exact libraries makes it easier to identify the components without manual calculations/operations. Note that the number of eigenvalues/eigenvectors will give you the number of dimensions and the amount of variance associated with those components.

Now as there are numerous principal components for large data, it is primarily selected on basis of which accounts for the largest possible variance. As a result, the next components are also decided in decreasing order of variance from earlier components by ordering eigenvalues, provided that these also do not have a correlation with earlier principal components. Then we discard those components with less eigenvalue/vectors(less significant).

In the last step, we use feature vectors to orient the data to the ones represented by the principal components(Principal Components Analysis). This is done by multiplying the transpose of the original data set by the transpose of the feature vector.

Cons of Using PCA/Disadvantages

You must note that data standardization ( which also includes converting categorical variables to numerical) is a must before using PCA. On applying PCA, the independent features become less interpretable because these principal components are also not readable or interpretable. There are also chances that you lose information while PCA.

Practical Example

Now, let us go through how an algorithm is implemented in a dataset. I’ll take you through each part of the code step by step.

Take a look at this dataset. This is the famous IRIS Flower Dataset, containing features as sepal length, petal length, sepal width, and petal width and the target variable is species. What you mean by target variable is the value/class you need to predict, which in this case is the class of species the flower belongs to.

data

source: Wikipedia

Importing Dataset and Basic Libraries

First of all, let us begin by importing the necessary libraries,

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

Loading the data and displaying the feature and class names for your understandings,

The below code snippet helps you get an analysis of the data, you get to know how many variables are categorical and how many are numerical. Moreover, it’s clear below that all rows are nonnull, in case there existed null objects, we get the count and the rows/columns in which they are present. This helps us to further take preprocessing steps for cleaning data.

data.info()

The data.describe() function generally gives a statistical overview of the dataset. These could be beneficial in many ways, you can use these data to fill up missing values, or create a new feature, and many more.

data.describe()

Here you are splitting the data in the features and target variables as X and y respectively. And using the shape method you know that the data has 150 rows and 5 columns in total, out of which 1 column is your target variable and 4 others are the features/attributes.

x = data.iloc[:,:4]  #features
y = data.iloc[:,4] #target
x.shape, y.shape

Out : ((150, 4), (150,))

Since all of the features are numerical, it is easy for the model for training. If the data contained categorical variables, we need to first convert them to numerical as machines/computers can deal better with numbers.

Importing PCA library

from sklearn.decomposition import PCA
pca = PCA()
X = pca.fit_transform(x)
pca.get_covariance()

explained_variance=pca.explained_variance_ratio_
explained_variance

Visualizations

with plt.style.context('dark_background'):
    plt.figure(figsize=(6, 4))
    plt.bar(range(4), explained_variance, alpha=0.5, align='center',
            label='individual explained variance')roduction
    plt.ylabel('Explained variance ratio')
    plt.xlabel('Principal components')
    plt.legend(loc='best')
    plt.tight_layout()

From the visualizations you get the intuitions that there are mainly only 3 components with significant variance, hence we select the number of principal components as 3.

pca = PCA(n_components=3)
X = pca.fit_transform(x)

Train Test Split

The train test split is a common training and evaluation method. Usually, predictions on the trained data itself can lead to overfitting, thus giving bad results for unknown data. In this case, by splitting the data into training and test sets, you train and then predict using the model on 2 different sets, thus resolving the issue of overfitting.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=20, stratify=y)

Model Training

Our aim is to identify the class/species to which the flower belongs given some of its features. Hence this is a classification problem and the model we use is using K Nearest Neighbors.

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(7)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)

Predictions

from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
cm = confusion_matrix(y_test, y_pred)  #confusion matrix
print(cm)
print(accuracy_score(y_test, y_pred))

The confusion matrix will show you the count of false positives, false negatives, true positives, and true negatives.

The accuracy score will give you how much our model has been effective in giving predictions for new data. 97% is a very good score, and hence we can say that ours is a good model.

You can view the complete code in this google collab provided.

Conclusion

I really hope you might have got intuition about PCA and also been familiar with the example discussed above. It’s not that complex to digest, just keep focus. Make sure you read this once again if you find this useful and work out the algorithm by yourself for better understandings.

Have a nice day !! : )

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.