# An Introductory Note on Principal Component Analysis

## Introduction

PCA, or Principal Component Analysis, is a term that is well-known to everyone. Notably employed for Curse of Dimensionality issues. In addition to this fundamental issue, there are other significant issues that we tackle in the PCA article. So, let’s start with fundamental knowledge. In this article, I’ve also added my handwritten manual technique for PCA, layman comprehension, some key theory, and a Python approach.

This article was published as a part of the Data Science Blogathon.

## Principal Component Analysis (PCA)

- PCA can be abbreviated as Principal Component Analysis
- PCA comes under the Unsupervised Machine Learning category
- Reducing the number of variables in a data collection while retaining as much information as feasible is the main goal of PCA. PCA can be mainly used for Dimensionality Reduction and also for important feature selection.
- Correlated features to Independent features

## What is Principal Component Analysis?

Technically, PCA provides a complete explanation of the composition of variance and covariance using multiple linear combinations of the core variables. Row scattering may be analyzed using PCA, which also identifies the distribution-related properties.

## Why do we need PCA?

When a computer is trained on a big, well-organized dataset, machine learning often excels. One of the techniques used to handle the curse of dimensionality in machine learning is principal component analysis (PCA). Typically, having a sufficient amount of data enables us to create a more accurate prediction model since we have more data to use to train the computer. But working with a huge data collection has its own drawbacks. The curse of dimensionality is the ultimate trap.

The title of an unreleased Harry Potter novel does not refer to what happens when your data has too many characteristics and perhaps not enough data points; rather, it refers to the curse of dimensionality. One can use dimensionality reduction to escape the dimensionality curse. Having 50 variables may be cut down to 40, 20, or even 10. The strongest effects of dimensionality reduction are found here.

Overfitting issues will arise while working with high-dimensional data, and dimensionality reduction will be used to address them. increasing interpretability and minimizing information loss. aids in locating important characteristics. aids in the discovery of a linear combination of varied sequences.

#### When to use PCA?

- Whenever we need to know our features are independent of each other
- Whenever we need fewer features from higher features

## Dimensionality Reduction Work in Real-Time Application

Assume there are 50 questions in all in the survey. The following three are among them: Please give the following a rating between 1 and 5:

- I feel comfortable around people
- I easily make friends
- I like going out

These queries could appear different now. There is a catch, though. They aren’t, generally speaking. They all gauge how extroverted you are. Therefore, combining them makes it logical, right? That’s where linear algebra and dimensionality reduction methods come in! We want to lessen the complexity of the problem by minimizing the number of variables since we have much too many variables that aren’t all that different. That is the main idea behind dimensionality reduction. And it just so happens that PCA is one of the most straightforward and popular techniques in this field. As a general guideline, maintain at least 70–80 percent of the explained variation.

#### Intuition behind PCA

Let’s assume we are playing a mind game here like,

Person | Height |

A | 145 |

B | 160 |

C | 185 |

from the above table, we need to find the tallest person.

I can by seeing person A is the tallest. Now change the scenario

Person | Height |

D | 172 |

E | 173 |

F | 171 |

Can you guess who’s who? It’s tough when they are very similar in height.

Because of how much their heights vary, we previously had no issue telling a 185cm person from a 160cm and a 145cm person. Similar to this, our data contains more information when its variance is bigger. This explains why the terms PCA and maximum variance are frequently used together.

## Basic Terminologies of PCA

Before getting into PCA, we need to understand some basic terminologies,

**Variance**– for calculating the variation of data distributed across dimensionality of graph**Covariance**– calculating dependencies and relationship between features**Standardizing data**– Scaling our dataset within a specific range for unbiased output

Image Source: PCA Terminologies

**Covariance matrix**– Used for calculating interdependencies between the features or variables and also helps in reduce it to improve the performance

Source: https://www.exceldemy.com/calculate-covariance-matrix-in-excel/

**EigenValues and EigenVectors**– Eigenvectors’ purpose is to find out the largest variance that exists in the dataset to calculate Principal Component. Eigenvalue means the magnitude of the Eigenvector. Eigenvalue indicates variance in a particular direction and whereas eigenvector is expanding or contracting X-Y (2D) graph without altering the direction.

Source: https://byjus.com/maths/eigen-values/

In this shear mapping, the blue arrow changes direction whereas the pink arrow does not. The pink arrow in this instance is an eigenvector because of its constant orientation. The length of this arrow is also unaltered, and its eigenvalue is 1. Technically, PC is a straight line that captures the maximum variance (information) of the data. PC shows direction and magnitude. PC are perpendicular to each other.

**Dimensionality Reduction –**Transpose of original data and multiply it by transposing of the derived feature vector. Reducing the features without losing information.

Source: https://www.displayr.com/category/data-science/dimension-reduction/

## How does PCA work?

The steps involved for PCA are as follows-

- Original Data
- Normalize the original data (mean =0, variance =1)
- Calculating covariance matrix
- Calculating Eigen values, Eigen vectors, and normalized Eigenvectors
- Calculating Principal Component (PC)
- Plot the graph for orthogonality between PCs

I have solved through manually, and the importance of hand-written notes is getting the crux behind the coding concepts,

We are calculating means and then calculating the covariance matrix between features.

After finding covariance matrix, we are going to calculate the eigenvalue, eigenvector, and normalized eigenvector

Steps involved in eigenvalues and vectors, in the manual approach

From this, we are going to calculate PCs

We are going to calculate the normalized eigenvector

Hence PCA is calculated and visually we can see how PC are orthogonal to each other.

#### How many PCAs are needed for any data?

This can be understood by,

PCA has maximum variance (information), which will be good to select.

Eigenvalues are used to find out which PCA has a maximum variance.

## Advantage for Principal Component Analysis

- Used for Dimensionality Reduction
- PCA will assist you in eliminating all related features, sometimes referred to as multi-collinearity.
- The time required to train your model is now substantially shorter because to PCA’s reduction in the number of features.
- PCA aids in overcoming overfitting by eliminating the extraneous features from your dataset.

## Disadvantage for Principal Component Analysis

- Useful for quantitative data but not effective with qualitative data.
- Interpretation of PC is difficult from original data

## Application for Principal Component Analysis

- Computer Vision
- Bio-informatics application
- For compressed images or resizing of the image
- Discovering patterns from high-dimensional data
- Reduction of dimensions
- Multidimensional Data – Visualization

## Python Code for Principal Component Analysis

Before working with any dataset, let’s try it with some randomly generated data:

rng = np.random.RandomState(1) X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T plt.scatter(X[:, 0], X[:, 1]) plt.axis('equal');

from sklearn.decomposition import PCA pca = PCA(n_components=2) pca.fit(X) print(pca.components_) print(pca.explained_variance_)

def draw_vector(v0, v1, ax=None): ax = ax or plt.gca() arrowprops=dict(arrowstyle='->', linewidth=2, shrinkA=0, shrinkB=0) ax.annotate('', v1, v0, arrowprops=arrowprops)

# plot data plt.scatter(X[:, 0], X[:, 1], alpha=0.2) for length, vector in zip(pca.explained_variance_, pca.components_): v = vector * 3 * np.sqrt(length) draw_vector(pca.mean_, pca.mean_ + v) plt.axis('equal');

pca = PCA(n_components=1) pca.fit(X) X_pca = pca.transform(X) print("original shape: ", X.shape) print("transformed shape:", X_pca.shape)

X_new = pca.inverse_transform(X_pca) plt.scatter(X[:, 0], X[:, 1], alpha=0.2) plt.scatter(X_new[:, 0], X_new[:, 1], alpha=0.8) plt.axis('equal');

These vectors represent the *principal axes* of the data, and the length of the vector is an indication of how “important” that axis is in describing the distribution of the data—more precisely, it is a measure of the variance of the data when projected onto that axis. The projection of each data point onto the principal axes are the “principal components” of the data.

If we plot these principal components besides the original data, we see the plots shown here:

pca = PCA(n_components=1) pca.fit(X) X_pca = pca.transform(X) print("original shape: ", X.shape) print("transformed shape:", X_pca.shape)

One dimension now exists for the converted data. We can run the inverse transform on this reduced data and display it next to the original data to visualize the impact of this dimensionality reduction:

X_new = pca.inverse_transform(X_pca) plt.scatter(X[:, 0], X[:, 1], alpha=0.2) plt.scatter(X_new[:, 0], X_new[:, 1], alpha=0.8) plt.axis('equal');

The actual data is represented by the bright dots, while the projected data is shown by the dark points. This explains what is meant by a PCA dimensionality reduction: the data along the primary axis(es) that are least relevant are deleted, leaving only the component(s) of the data that have the largest variance. The amount of “information” lost in this decrease of dimensionality is generally measured by the proportion of variance that is eliminated.

For better understanding, we are working with the default pre-loaded dataset called breast cancer.

from sklearn.datasets import load_breast_cancer breast_cancer = load_breast_cancer()

print(breast_cancer.feature_names) print(len(breast_cancer.feature_names))

import numpy as np print(breast_cancer.target) print(breast_cancer.target_names) print(np.array(np.unique(breast_cancer.target, return_counts=True)))

import numpy as np import matplotlib.pyplot as plt _, axes = plt.subplots(6,5, figsize=(15, 15)) malignant = breast_cancer.data[breast_cancer.target==0] benign = breast_cancer.data[breast_cancer.target==1] ax = axes.ravel() # flatten the 2D array for i in range(30): # for each of the 30 features bins = 40

#---plot histogram for each feature--- ax[i].hist(malignant[:,i], bins=bins, color='r', alpha=.5) ax[i].hist(benign[:,i], bins=bins, color='b', alpha=0.3) #---set the title--- ax[i].set_title(breast_cancer.feature_names[i], fontsize=12)

#---display the legend--- ax[i].legend(['malignant','benign'], loc='best', fontsize=8) plt.tight_layout() plt.show() import pandas as pd df = pd.DataFrame(breast_cancer.data, columns = breast_cancer.feature_names) df['diagnosis'] = breast_cancer.target df

#Training the Model using all the Features from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split X = df.iloc[:,:-1] y = df.iloc[:,-1] #---perform a split--- random_state = 12 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, shuffle = True, random_state=random_state)

#---train the model using Logistic Regression--- log_reg = LogisticRegression(max_iter = 5000) log_reg.fit(X_train, y_train) #---evaluate the model--- log_reg.score(X_test,y_test)

#Training the Model using Reduced Features df_corr = df.corr()['diagnosis'].abs().sort_values(ascending=False) df_corr # get all the features that has at least 0.6 in correlation to the # target features = df_corr[df_corr > 0.6].index.to_list()[1:] features # without the 'diagnosis' column

#Checking for MultiCollinearity import pandas as pd from sklearn.linear_model import LinearRegression def calculate_vif(df, features): vif, tolerance = {}, {} # all the features that you want to examine for feature in features: # extract all the other features you will regress against X = [f for f in features if f != feature] X, y = df[X], df[feature]

# extract r-squared from the fit r2 = LinearRegression().fit(X, y).score(X, y) # calculate tolerance tolerance[feature] = 1 - r2 # calculate VIF vif[feature] = 1/(tolerance[feature])

# return VIF DataFrame return pd.DataFrame({'VIF': vif, 'Tolerance': tolerance}) calculate_vif(df,features) # try to reduce those feature that has high VIF until each feature # has VIF less than 5 features = [ 'worst concave points', 'mean radius', 'mean concavity', ] calculate_vif(df,features) #Training the Model from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split X = df.loc[:,features] # get the reduced features in the # dataframe y = df.loc[:,'diagnosis']

# perform a split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, shuffle = True, random_state=random_state) log_reg = LogisticRegression() log_reg.fit(X_train, y_train) log_reg.score(X_test,y_test)

#Training the Model using Reduced Features (PCA) #Performing Standard Scaling from sklearn.preprocessing import StandardScaler # get the features and label from the original dataframe X = df.iloc[:,:-1] y = df.iloc[:,-1]

# performing standardization sc = StandardScaler() X_scaled = sc.fit_transform(X)

#Applying Principal Component Analysis (PCA)

from sklearn.decomposition import PCA components = None pca = PCA(n_components = components) # perform PCA on the scaled data pca.fit(X_scaled)

# print the explained variances print("Variances (Percentage):") print(pca.explained_variance_ratio_ * 100) print() print("Cumulative Variances (Percentage):") print(pca.explained_variance_ratio_.cumsum() * 100) print()

# plot a scree plot components = len(pca.explained_variance_ratio_) if components is None else components plt.plot(range(1,components+1), np.cumsum(pca.explained_variance_ratio_ * 100)) plt.xlabel("Number of components") plt.ylabel("Explained variance (%)") from sklearn.decomposition import PCA pca = PCA(n_components = 0.85) pca.fit(X_scaled) print("Cumulative Variances (Percentage):") print(np.cumsum(pca.explained_variance_ratio_ * 100)) components = len(pca.explained_variance_ratio_) print(f'Number of components: {components}')

# Make the scree plot plt.plot(range(1, components + 1), np.cumsum(pca.explained_variance_ratio_ * 100)) plt.xlabel("Number of components") plt.ylabel("Explained variance (%)") pca_components = abs(pca.components_) print(pca_components)

print('Top 4 most important features in each component') print('===============================================') for row in range(pca_components.shape[0]): # get the indices of the top 4 values in each row temp = np.argpartition(-(pca_components[row]), 4) # sort the indices in descending order indices = temp[np.argsort((-pca_components[row])[temp])][:4] # print the top 4 feature names print(f'Component {row}: {df.columns[indices].to_list()}')

#Transforming all the 30 Columns to the 6 Principal Components X_pca = pca.transform(X_scaled) print(X_pca.shape) print(X_pca)

#Creating a Machine Learning Pipeline

from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from sklearn.linear_model import LogisticRegression _sc = StandardScaler() _pca = PCA(n_components = components) _model = LogisticRegression() log_regress_model = Pipeline([ ('std_scaler', _sc), ('pca', _pca), ('regressor', _model) ])

# perform a split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True, random_state=random_state)

```
# train the model using the PCA components
log_regress_model.fit(X_train,y_train)
log_regress_model.score(X_test,y_test)
```

## Conclusion

I anticipate that the learners now have some understanding of Principal Component Analysis, the most important method in unsupervised machine learning. Principal Component Analysis is used for more than simply dimension reduction; it may also be used to identify key characteristics and solve multicollinearity issues. Although the knowledge I’ve provided here is important and useful for the projects we’ll be using, there are still a lot of things we need to understand. In upcoming columns, I’ll be disclosing. Coding and theory by themselves won’t make any issue easier to comprehend. Because of this, I also included handwritten comments. This gives the reader additional context and shows them how much you care. Continue reading! So after reading this particular blog, learners can understand,

1. How to calculate manually without coding for PCA?

2. What are the important key concepts that we need to explore from PCA, like EigenVector, EigenValue, and important components?

3. How to approach PCA in Python coding?

**The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.**