Demystifying the working of Principal Component Analysis!

Himanshu Last Updated : 15 May, 2021

5 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Principal Component Analysis (PCA) is one of the prominent dimensionality reduction techniques. It is valuable when we need to reduce the dimension of the dataset while retaining maximum information.

In this article, we will learn the need for PCA, PCA working, preprocessing steps required before applying PCA, and the interpretation of principal components.

Why do we need PCA?

PCA is not required unless you have a dataset with a large number of attributes. Generally, when we deal with real-world data we encounter a huge messy dataset with a large number of attributes.

If we apply any Machine Learning model on a huge dataset without reducing its dimensions then it would be computationally expensive.

Therefore, to reduce the dimension and to retain maximum information we need PCA as our objective is to deliver accurate ML models with less time and space complexity.

PCA is needed when dataset have large number of attributes. We can avoid PCA for smaller datasets.

Is there any preprocessing step required before applying PCA?

We need to keep the below points in our mind before applying PCA

PCA can not be applied to the dataset with null values. Hence, you need to treat null values before proceeding with PCA. There are different ways of treating null values such as dropping the variables and imputing the missing data using mean or median.

We shouldn’t apply PCA on the dataset having attributes on different scales. We need to standardize variables before applying PCA.

Let us take an example of Facebook Metric data set

This dataset has 19 columns(or dimensions) and we will try to reduce its dimensions using PCA. Below you will find the python code and its output. We have dropped one categorical column for simplicity of analysis.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv(r"C:\Users\Himanshu\Downloads\Facebook_metrics\dataset_Facebook.csv",sep = ';')
data.drop(columns = 'Type',inplace = True)  ##For simplicity we keep all data as numerical
data.head()

The output of the Facebook metrics dataset

We will check the statistical summary of our dataset to find the scale of different attributes. Below we can see that every attribute is on a different scale. Therefore, we can not jump to PCA directly without changing the scales of attributes.

Statistical summary of data

We see that column “Post Weekday” has less variance and column “Lifetime Post Total Reach” has comparatively more variance.

Therefore, if we apply PCA without standardization of data then more weightage will be given to the “Lifetime Post Total Reach” column during the calculation of “eigenvectors” and “eigenvalues” and we will get biased principal components.

Now we will standardize the dataset using RobustScaler of sklearn library. Other ways of standardizing data are provided in sklearn like StandardScaler and MinMaxScaler and can be chosen as per the requirement.

from sklearn.preprocessing import RobustScaler
rs = RobustScaler()
scaled = pd.DataFrame(rs.fit_transform(data),columns = data.columns)
scaled.head()

Who decides the number of principal components?

Unless specified, the number of principal components will be equal to the number of attributes.

Our dataset has 18 attributes initially hence we get 18 principal components. These components are new variables which are in fact a linear combination of input variables.

Once we get the amount of variance explained by each principal component we can decide how many components we need for our model based on the amount of information we want to retain.

Principal components are uncorrelated with each other. These principal components are known as eigenvectors and the variances explained by each eigenvector is known as eigenvalues.

Below we have applied PCA on the scaled datasets. If we want a predefined number of components then we can do that it using PCA(n_components)

from sklearn.decomposition import PCA
scaled_data = scaled.dropna()
pca = PCA() ## If we need predefined number of components we can set n_components to any integer value
pca.fit_transform(scaled_data)
print(pca.explained_variance_ratio_)

Here the output is the variance explained by each principal component. We have 18 attributes in our dataset and hence we get 18 principal components.

Always remember that the first principal component will always hold maximum variance

You can observe the same in the output that the first principal component holds maximum variance followed by subsequent components.

Interpretation of Principal Component

Now we have 18 principal components and we will try to find out how these components are influenced by each attribute.

We can check the influence of the top 3 attributes (both positive and negative) for the first principal component.

Below is the python code to fetch the influence of attributes on principal components by changing the number of features and number of components.

def feature_weight(pca, n_comp, n_feat):
    #df = pd.DataFrame(np.round(pca.components_,2),columns = scaled_data.columns)
    comp = pd.DataFrame(np.round(pca.components_, 2), columns=scaled_data.keys()).iloc[n_comp - 1]
    comp.sort_values(ascending=False, inplace=True)
    comp = pd.concat([comp.head(n_feat), comp.tail(n_feat)])
    comp.plot(kind='bar', title='Top {} weighted attributes for PCA component {}'.format(n_feat, n_comp))
    plt.show()
    return comp
feature_weight(pca,0,3)

top weighted features| principal component analysis

We can interpret here that our first principal component is mostly influenced by engagement to the post (like, comment, impression, and reach).

Likewise, we can interpret other principal components as per the understanding of data using the above plot.

Plot to visualize variance by each principal component: Scree Plot

Below you can see a scree plot that depicts the variance explained by each principal component.

Here we can see that the top 8 components account for more than 95% variance. We can use these 8 principal components for our modelling purpose.

def screeplot(pca):
    var_len = len(pca.explained_variance_ratio_)
    indx = np.arange(var_len)
    var_pca = pca.explained_variance_ratio_
    plt.figure(figsize=(14, 8))
    ax = plt.subplot()
    cum_var = np.cumsum(var_pca)
    ax.bar(indx, var_pca)
    ax.plot(indx, cum_var)
    ax.set_xlabel("Principal Components")
    ax.set_ylabel("Percentage Variance Explained")
    plt.title('Cumulative Variance Explained by Principal Components')
screeplot(pca)

Scree Plot | principal component analysis

Finally, we reduce the number of attributes to 8 from the initial 18 attributes. We were also able to retain 95% information of our dataset. Voila !! 🙂

Below is the scree plot for unscaled data just to check how different our principal components will be in the scaled version. We can see that there is a huge difference in principal components and the amount of variance explained. Here, the first component is explaining around 85% variance.

Similarly, you can check for each principal component how they have been influenced by attributes of unscaled data.

I hope this article would help to understand the basics of PCA.

If you like this article then I will share another article with basic mathematics about PCA.

The media shown in this article on Data Visualizations in Julia are not owned by Analytics Vidhya and is used at the Author’s discretion

Himanshu

Beginner Machine Learning Python Structured Data Supervised

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Demystifying the working of Principal Component Analysis!

Introduction

Why do we need PCA?

Is there any preprocessing step required before applying PCA?

Who decides the number of principal components?

Interpretation of Principal Component

Plot to visualize variance by each principal component: Scree Plot

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC