avcontentteam — Published On March 21, 2016 and Last Modified On April 26th, 2023
Algorithm Data Science Intermediate Python R Statistics Structured Data

Introduction

Too much of anything is good for nothing!

Picture this – you are working on a large-scale data science project. What happens when the given data set has too many variables? There are a few possible situations that you might come across. For instance, you find that most of the variables are correlated on analysis, and you become indecisive about what to do; hence you lose patience and decide to run a model on the whole data. This returns poor accuracy, and you feel terrible and start thinking of some strategic method to find a few important variables. That’s where Principal Component Analysis (PCA) is used.

Trust me, dealing with such situations isn’t as difficult as it sounds. This is the most common scenario in machine learning projects. Statistical techniques such as factor analysis and principal component analysis (PCA) help to overcome such difficulties. In this post, I’ve explained the concept of PCA. I’ve kept the explanation to be simple and informative. I’ve also demonstrated using this technique in R with interpretations for practical understanding.

Note: Understanding this concept requires prior knowledge of statistics
Learning Objectives

  • Step by step, you will learn the widely used technique of dimension reduction, which is Principal Component Analysis (PCA).
  • You will also learn how to extract the important factors from the data with the help of PCA.
  • At last, you will learn the Implementation of PCA in both R and Python.

PCA, principal component analysis Practical guide to Principal Component Analysis in R & Python

Table of Contents

What Is Principal Component Analysis?

Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of a dataset while preserving maximum variation. It transforms the original variables into a new set of linearly uncorrelated variables called principal components.

PCA is commonly used in data exploration, visualization, and machine learning. It is a powerful tool for data visualization and interpretation, particularly in high-dimensional datasets.

PCA works by finding the directions of maximum variance in the data set and projecting the data onto these directions. The principal components are ordered by the amount of variance they explain and are used for feature selection, data compression, clustering, and classification.

PCA has several advantages over other dimensionality reduction techniques, such as linearity, computational efficiency, and the ability to handle large datasets. However, it also has some limitations, such as assumptions about normal distribution and linearity and the potential for information loss.

It is always performed on a symmetric correlation or covariance matrix. This means the matrix should be numeric and have standardized data.

The covariance matrix defines the spread (variance) and the orientation (covariance) of the dataset. The direction of the spread of the dataset is computed by eigenvectors and its magnitude by eigenvalues. The no. of eigenvectors depends on the no. of principal components chosen.

Let’s understand it using an example:

Let’s say we have a data set of dimension 300 (n) × 50 (p). n represents the number of observations, and p represents the number of predictors. Since we have a large p = 50, there can be p(p-1)/2 scatter plots, i.e., more than 1000 plots possible to analyze the variable relationship. Wouldn’t it be a tedious job to perform exploratory analysis on this data?

In this case, it would be a lucid approach to select a subset of p (p << 50) predictor which captures so much information, followed by plotting the observation in the resultant low-dimensional space.

The image below shows the transformation of high-dimensional data (3 dimension) to low-dimensional data (2 dimension) using PCA. Not to forget, each resultant dimension is a linear combination of p features

PCA : ProjectionsSource: nlpca

What Are Principal Components?

A principal component is a normalized linear combination of the original features in a data set. In the image above, PC1 and PC2 are the principal components. Let’s say we have a set of predictors as X¹, X²...,Xp

The principal component can be written as:

Z¹ = Φ¹¹X¹ + Φ²¹X² + Φ³¹X³ + .... +Φp¹Xp

where,

  • Z¹ is the first principal component
  • Φp¹ is the loading vector comprising loadings (Φ¹, Φ²..) of the first principal component. The loadings are constrained to a sum of squares equals to 1. This is because a large magnitude of loadings may lead to a large variance. It also defines the direction of the principal component (Z¹), along which data varies the most. It results in a line in p dimensional space, which is closest to the n observations. Closeness is measured using average squared euclidean distance.
  • X¹..Xp are normalized predictors. Normalized predictors have mean values equal to zero and standard deviations equal to one.

First Principal Component

The first principal component is a linear combination of original predictor variables that captures the data set’s maximum variance. It determines the direction of highest variability in the data. Larger the variability captured in the first component, larger the information captured by component. No other component can have variability higher than first principal component.

The first principal component results in a line that is closest to the data, i.e., it minimizes the sum of squared distance between a data point and the line.

Similarly, we can compute the second principal component also.

Second Principal Component ()

The second principal component is also a linear combination of original predictors, which captures the remaining variance in the data set and is uncorrelated with . In other words, the correlation between first and second components should be zero. It can be represented as:

Z² = Φ¹²X¹ + Φ²²X² + Φ³²X³ + .... + Φp2Xp

If the two components are uncorrelated, their directions should be orthogonal (image below). This image is based on simulated data with 2 predictors. Notice the direction of the components; as expected, they are orthogonal. This suggests the correlation b/w these components is zero.

PCA : Orthogonality of Principal Components

All succeeding principal component follows a similar concept, i.e., they capture the remaining variation without being correlated with the previous component. In general, for n × p dimensional data, min(n-1, p) principal component can be constructed.

The directions of these components are identified unsupervised; i.e., the response variable(Y) is not used to determine the component direction. Therefore, it is an unsupervised approach.

Note: Partial least square (PLS) is a supervised alternative to PCA. PLS assigns a higher weight to variables that are strongly related to response variable to determine principal components.

Why Is Normalization of Variables Necessary in PCA?

The principal components are supplied with a normalized version of the original predictors. This is because the original predictors may have different scales. For example: Imagine a data set with variables measuring units as gallons, kilometers, light years, etc. The scale of variances in these variables will obviously be large.

Performing PCA on un-normalized variables will lead to exponentially large loadings for variables with high variance. In turn, this will lead to the dependence of a principal component on the variable with high variance. This is undesirable.

As shown in the image below, PCA was run on a data set twice (with unscaled and scaled predictors). This data set has ~40 variables. You can see a variable Item_MRP dominates first principal component and a variable Item_Weight dominates the second principal component. This domination prevails due to high value of variance associated with a variable. When the variables are scaled, we get a much better representation of variables in 2D space.

PCA : Effect of Normalisation on PCA

Implement PCA in R & Python (With Interpretation)

How many principal components to choose from the original dataset? I could dive deep into theory, but it would be better to answer these questions practically.

For this demonstration, I’ll be using the data set from Big Mart Prediction Challenge III.

Remember, Principal Component Analysis can be applied only to numerical data. Therefore, if the data have categorical variables, they must be converted to numerical ones. Also, make sure you have done the basic data cleaning prior to implementing this technique. Let’s quickly finish with initial data loading and cleaning steps:

#directory path
> path <- ".../Data/Big_Mart_Sales"
#set working directory
> setwd(path)
#load train and test file
> train <- read.csv("train_Big.csv")
> test <- read.csv("test_Big.csv")
#add a column
> test$Item_Outlet_Sales <- 1
#combine the data set
> combi <- rbind(train, test)
#impute missing values with median
> combi$Item_Weight[is.na(combi$Item_Weight)] <- median(combi$Item_Weight, na.rm = TRUE)
#impute 0 with median
> combi$Item_Visibility <- ifelse(combi$Item_Visibility == 0, median(combi$Item_Visibility),                                   combi$Item_Visibility)
#find mode and impute
> table(combi$Outlet_Size, combi$Outlet_Type)
> levels(combi$Outlet_Size)[1] <- "Other"

Till here, we’ve imputed missing values. Now we are left with removing the dependent (response) variable and other identifier variables( if any). As we said above, we are practicing an unsupervised learning technique; hence response variable must be removed.

#remove the dependent and identifier variables
> my_data <- subset(combi, select = -c(Item_Outlet_Sales, Item_Identifier,                                       Outlet_Identifier))

Let’s check the available variables ( a.k.a predictors) in the data set.

#check available variables
> colnames(my_data)

Since PCA works on numeric variables, let’s see if we have any variables other than numeric.

#check variable class
> str(my_data)
'data.frame': 14204 obs. of 9 variables:
$ Item_Weight : num 9.3 5.92 17.5 19.2 8.93 ...
$ Item_Fat_Content : Factor w/ 5 levels "LF","low fat",..: 3 5 3 5 3 5 5 3 5 5 ...
$ Item_Visibility : num 0.016 0.0193 0.0168 0.054 0.054 ...
$ Item_Type : Factor w/ 16 levels "Baking Goods",..: 5 15 11 7 10 1 14 14 6 6 ...
$ Item_MRP : num 249.8 48.3 141.6 182.1 53.9 ...
$ Outlet_Establishment_Year: int 1999 2009 1999 1998 1987 2009 1987 1985 2002 2007 ...
$ Outlet_Size : Factor w/ 4 levels "Other","High",..: 3 3 3 1 2 3 2 3 1 1 ...
$ Outlet_Location_Type : Factor w/ 3 levels "Tier 1","Tier 2",..: 1 3 1 3 3 3 3 3 2 2 ...
$ Outlet_Type : Factor w/ 4 levels "Grocery Store",..: 2 3 2 1 2 3 2 4 2 2 ...

Sadly, 6 out of 9 variables are categorical in nature. We have some additional work to do now. We’ll convert these categorical variables into numeric ones using one hot encoding.

#load library
> library(dummies)
#create a dummy data frame
> new_my_data <- dummy.data.frame(my_data, names = c("Item_Fat_Content","Item_Type",
                                "Outlet_Establishment_Year","Outlet_Size",
                                "Outlet_Location_Type","Outlet_Type"))

To check if we now have a data set of integer values, simply write:

#check the data set
> str(new_my_data)

And we now have all the numerical values. Let’s divide the data into test and train.

#divide the new data
> pca.train <- new_my_data[1:nrow(train),]
> pca.test <- new_my_data[-(1:nrow(train)),]

We can now go ahead with PCA.

The base R function prcomp() is used to perform PCA. By default, it centers the variable to have a mean equal to zero. With parameter scale. = T, we normalize the variables to have a standard deviation equal to 1.
#principal component analysis
> prin_comp <- prcomp(pca.train, scale. = T)
> names(prin_comp)
[1] "sdev"     "rotation" "center"   "scale"    "x"
The prcomp() function results in 5 useful measures:

1. center and scale refers to respective mean and standard deviation of the variables that are used for normalization prior to implementing PCA

#outputs the mean of variables
prin_comp$center
#outputs the standard deviation of variables
prin_comp$scale

2. The rotation measure provides the principal component loading. Each column of rotation matrix contains the principal component loading vector. This is the most important measure we should be interested in.

> prin_comp$rotation

This returns 44 principal component loadings. Is that correct? Absolutely. The maximum number of principal component loadings in a data set is a minimum of (n-1, p). Let’s look at the first 4 principal components and first 5 rows.

> prin_comp$rotation[1:5,1:4]
                                PC1            PC2            PC3             PC4
Item_Weight                0.0054429225   -0.001285666   0.011246194   0.011887106
Item_Fat_ContentLF        -0.0021983314    0.003768557  -0.009790094  -0.016789483
Item_Fat_Contentlow fat   -0.0019042710    0.001866905  -0.003066415  -0.018396143
Item_Fat_ContentLow Fat    0.0027936467   -0.002234328   0.028309811   0.056822747
Item_Fat_Contentreg        0.0002936319    0.001120931   0.009033254  -0.001026615

3. To compute the principal component score vector, we don’t need to multiply the loading with data. Rather, the matrix x has the principal component score vectors in an 8523 × 44 dimension.

> dim(prin_comp$x)
[1] 8523    44

Let’s plot the resultant principal components.

> biplot(prin_comp, scale = 0)
2 most prominent principal components

The parameter scale = 0ensures that arrows are scaled to represent the loadings. To infer from the image above, focus on this graph’s extreme ends (top, bottom, left, right).

We infer that the first principal component corresponds to Outlet_TypeSupermarket, Outlet_Establishment_Year 2007. Similarly, it can be said that the second component corresponds to a measure of Outlet_Location_TypeTier1, Outlet_Sizeother. For the exact measure of a variable in a component, you should look at rotation matrix(above) again.

4. The prcomp() function also provides the facility to compute standard deviation of each principal component. sdev refers to the standard deviation of principal components.

#compute standard deviation of each principal component
> std_dev <- prin_comp$sdev
#compute variance
> pr_var <- std_dev^2
#check variance of first 10 components
> pr_var[1:10]
[1] 4.563615 3.217702 2.744726 2.541091 2.198152 2.015320 1.932076 1.256831
[9] 1.203791 1.168101

We aim to find the components which explain the maximum variance. This is because, we want to retain as much information as possible using these components. So, higher is the explained variance, higher will be the information contained in those components.

5. To compute the proportion of variance explained by each component, we simply divide the variance by sum of total variance. This results in:

#proportion of variance explained
> prop_varex <- pr_var/sum(pr_var)
> prop_varex[1:20]
[1] 0.10371853 0.07312958 0.06238014 0.05775207 0.04995800 0.04580274
[7] 0.04391081 0.02856433 0.02735888 0.02654774 0.02559876 0.02556797
[13] 0.02549516 0.02508831 0.02493932 0.02490938 0.02468313 0.02446016
[19] 0.02390367 0.02371118

This shows that first principal component explains 10.3% variance. Second component explains 7.3% variance. Third component explains 6.2% variance and so on. So, how do we decide how many components should we select for modeling stage ?

The answer to this question is provided by a scree plot. A scree plot is used to access components or factors which explains the most of variability in the data. It represents values in descending order.

#scree plot
> plot(prop_varex, xlab = "Principal Component",
             ylab = "Proportion of Variance Explained",
             type = "b")
scree plot in R

The plot above shows that ~ 30 components explains around 98.4% variance in the data set. In order words, using PCA we have reduced 44 predictors to 30 without compromising on explained variance. This is the power of PCA> Let’s do a confirmation check, by plotting a cumulative variance plot. This will give us a clear picture of number of components.

#cumulative scree plot
> plot(cumsum(prop_varex), xlab = "Principal Component",
              ylab = "Cumulative Proportion of Variance Explained",
              type = "b")
PCA : cumulative explained variance

This plot shows that 30 components result in a variance close to ~ 98%. Therefore, in this case, we’ll select the number of components as 30 [PC1 to PC30] and proceed to the modeling stage. This completes the steps to implement PCA on train data. For modeling, we’ll use these 30 components as predictor variables and follow the normal procedures.

This plot shows that 30 components result in a variance close to ~ 98%. Therefore, in this case, we’ll select the number of components as 30 [PC1 to PC30] and proceed to the modeling stage. This completes the steps to implement PCA on train data. For modeling, we’ll use these 30 components as predictor variables and follow the normal procedures.

Predictive Modeling With PCA Components

PCA is used along with machine learning algorithms for predictive modeling.

After we’ve performed PCA on the training set, let’s now understand the process of predicting test data using these components. The process is simple. Just like we’ve obtained PCA components on the training set, we’ll get another bunch of components on the testing set. Finally, we train the model.

But, few important points to understand:

  1. We should not combine the train and test set to obtain PCA components of the whole data at once, as this would violate the assumption of generalization since the test data would get ‘leaked’ into the training set. In other words, the test data set would no longer remain ‘unseen’. Eventually, this will hammer down the generalization capability of the model.
  2. We should not perform PCA on test and train data sets separately because the resultant vectors from the train and test PCAs will have different directions (due to unequal variance). Due to this, we’ll end up comparing data registered on different axes. Therefore, the resulting train and test data vectors should have same axes.

So, what should we do?

We should do exactly the same transformation to the test set as we did to the training set, including the center and scaling feature. Let’s do it in R:

#add a training set with principal components
> train.data <- data.frame(Item_Outlet_Sales = train$Item_Outlet_Sales, prin_comp$x)
#we are interested in first 30 PCAs
> train.data <- train.data[,1:31]
#run a decision tree
> install.packages("rpart")
> library(rpart)
> rpart.model <- rpart(Item_Outlet_Sales ~ .,data = train.data, method = "anova")
> rpart.model
#transform test into PCA
> test.data <- predict(prin_comp, newdata = pca.test)
> test.data <- as.data.frame(test.data)
#select the first 30 components
> test.data <- test.data[,1:30]
#make prediction on test data
> rpart.prediction <- predict(rpart.model, test.data)
#For fun, finally check your score of leaderboard
> sample <- read.csv("SampleSubmission_TmnO39y.csv")
> final.sub <- data.frame(Item_Identifier = sample$Item_Identifier, Outlet_Identifier = sample$Outlet_Identifier, Item_Outlet_Sales = rpart.prediction)
> write.csv(final.sub, "pca.csv",row.names = F)

That’s the complete modeling process after PCA extraction. I’m sure you wouldn’t be happy with your leaderboard rank after you upload the solution. Try using random forest!

For Python Users:
To implement PCA in python, import PCA from sklearn library. The interpretation remains same as explained for R users above. Of course, the result is some as derived after using R. The data set used for Python is a cleaned version where missing values have been imputed, and categorical variables are converted into numeric. The modeling process remains same, as explained for R users above.

import numpy as np
from sklearn.decomposition import PCA
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
%matplotlib inline
#Load data set
data = pd.read_csv('Big_Mart_PCA.csv')
#convert it to numpy arrays
X=data.values
#Scaling the values
X = scale(X)
pca = PCA(n_components=44)
pca.fit(X)
#The amount of variance that each PC explains
var= pca.explained_variance_ratio_
#Cumulative Variance explains
var1=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=4)*100)
print var1
[  10.37   17.68   23.92   29.7    34.7    39.28   43.67   46.53   49.27
51.92   54.48   57.04   59.59   62.1    64.59   67.08   69.55   72.
74.39   76.76   79.1    81.44   83.77   86.06   88.33   90.59   92.7
94.76   96.78   98.44  100.01  100.01  100.01  100.01  100.01  100.01
100.01  100.01  100.01  100.01  100.01  100.01  100.01  100.01]
plt.plot(var1)

Reduction in variance with predictor variables
#Looking at above plot I'm taking 30 variables
pca = PCA(n_components=30)
pca.fit(X)
X1=pca.fit_transform(X)
print X1

For more information on PCA in python, visit scikit learn documentation.

Conclusion

This brings me to the end of this tutorial. Without delving deep into mathematics, I’ve tried to make you familiar with most important concepts required to use this technique. It’s simple but needs special attention when deciding the number of components. Practically, we should strive to retain only the first few k components. The idea behind PCA is to construct some principal components( Z << Xp ) which satisfactorily explain most of the data’s variability and relationship with the response variable.

Key Takeaways

  • Principal Component Analysis (PCA) is used to overcome feature redundancy in a data set. These features are low dimensional in nature. These features, a.k.a components, are a result of normalized linear combinations of original predictor variables.
  • The first component has the highest variance, followed by second, third, and so on. The components must be uncorrelated (remember orthogonal direction ?). See above.
  • Normalizing data becomes extremely important when the predictors are measured in different units.PCA works best on data sets having 3 or higher dimensions. Because, with higher dimensions, it becomes increasingly difficult to make interpretations from the resultant data cloud.

Frequently Asked Questions

Q1. What is PCA principal component analysis in python?

A. PCA/Principal Component Analysis is a dimensionality reduction technique that transforms your original dataset with n features into another dataset with m features. Note that m<n.

Q2. What are the steps for PCA in Python?

A. Consider a dataset (nxd) with n no. of observations and d no. of features. Let’s say we want to reduce the no. of features from d to k.
Here are the steps involved in PCA:
1. Compute the covariance matrix (dxd).
2. Derive the eigenvectors and corresponding eigenvalues.
3. Construct the projection matrix (dxk) that comprises eigenvectors. The no. of eigenvalues to consider depends upon the no. of principal components.
4. Transform the original dataset (nxd) into another dataset (nxk) with the projection matrix (dxk).
5. The resulting matrix is our transformed dataset (nxk).

Q3. What are the advantages of PCA?

A. The advantages of PCA are that it counters the Curse of Dimensionality, removes the unwanted noise present in the dataset, and preserves the signal required.

66 thoughts on "PCA | What Is Principal Component Analysis & How It Works? (Updated 2023)"

Tuhin Chattopadhyay
Tuhin Chattopadhyay says: March 21, 2016 at 6:19 am
Excellent Manish Reply
Analytics Vidhya Content Team
Analytics Vidhya Content Team says: March 21, 2016 at 6:37 am
Your appreciation means a lot. Thank you Tuhin Sir :) Reply
Surobhi
Surobhi says: March 21, 2016 at 7:24 am
Hi Manish, Information given about PCA in your article was very comprehensive as you have covered both the theoretical and the implementation part very well. It was fun and simple to understand too. Can you please write a similar one for Factor Analysis? How is it different from PCA and how to decide on the method of dimensional reduction case to case. Thanks Reply
Prasoon Saxena
Prasoon Saxena says: March 21, 2016 at 7:32 am
This is good explanation Manish and thank you for sharing it. Quick question, model created using these 30pca will have all 50 independent variable but if I want to figure out what among those 50 independent variables which are most critical one then how we figure that so that we can build model using those specific variables. Will appreciate your help. Thanks Reply
Hunaidkhan
Hunaidkhan says: March 21, 2016 at 8:40 am
Really informative Manish, Also variables derived from PCA can be used for Regression analysis. Regression analysis with PCA gives a better prediction and less error. Reply
Debarshi
Debarshi says: March 21, 2016 at 8:46 am
I have used PCA recently in one projects, and would like to add few points: -PCA reduce the dimension but the the result is not very intuitive, as each PCs are combination of all the original variables. So use 'Factor Analysis' (Factor Rotation) on top of PCA to get a better relationship between PCs (rather Factors) and original Variable, this result was brilliant in an Insurance data. -If you have perfectly correlated variables (A & B) then also PCA will not suggest you to drop one, rather it will suggest to use a combination of these two (A+B), but off course it will reduce the dimension -This is different from feature selection, don't mix these two concept -There is a concept of 'Nonlinear PCA' which helps to include non Numeric values as well. -If you want to reduce the dimension (or numbers) of predictors (X) remember PCA does not consider response (Y) while reducing the dimension, your original variables may be (??) a better predictors. Reply
Venu
Venu says: March 21, 2016 at 8:49 am
Good One Reply
Analytics Vidhya Content Team
Analytics Vidhya Content Team says: March 21, 2016 at 10:39 am
Thanks Surobhi ! I already have it in my plan to write soon one detailed post on Factor Analysis. Wish me luck! Reply
Pallavi
Pallavi says: March 21, 2016 at 10:42 am
Hi Manish, Another good article! I have always found it difficult to explain the principle components to business users. Would really appreciate, if you also write how do you explain the PCA to business users... What general questions you get from business users and how to handle those. Thanks Reply
Analytics Vidhya Content Team
Analytics Vidhya Content Team says: March 21, 2016 at 10:47 am
Hello For model building, we'll use the resultant 30 components as independent variables. Remember, each component is a vector comprising of principal component score derived from each predictor variable (in this case we have 50). Check prin_comp$rotation for principal component scores in each vector. This technique is used to shrink the dimension of a data set such that it becomes easier to analyze, visualize and interpret. By 'critical', I assume you are talking about measuring variable importance. If that's the case, you can look for p values, t statistics in regression. For variable selection, regression is equipped with various approaches such as forward selection, backward selection, step wise selection etc. Reply
Analytics Vidhya Content Team
Analytics Vidhya Content Team says: March 21, 2016 at 11:00 am
Rightly said. PCA when used for regression takes a form of supervised approach known as PLS (partial least squares). In PLS, the response variable is used for identification of principal components. Reply
Analytics Vidhya Content Team
Analytics Vidhya Content Team says: March 21, 2016 at 11:02 am
Thanks a lot Debarshi. These are so much helpful. Reply
Dox vK
Dox vK says: March 21, 2016 at 12:17 pm
I understand there is a PCA for qualitative data ... could some one provide me with a good intutive resource for suvh? Reply
Muthupandi
Muthupandi says: March 21, 2016 at 12:45 pm
Hi Manish Thanks for the article, had good explanation and walk through. In PCA on R u have shown hot to get the name of the variable and its missing in python, can u explain how i can get the names of the variables after reduction so that we can use it for model building? or am i understanding it wrongly Reply
Krishna
Krishna says: March 21, 2016 at 1:17 pm
Hi Manish, The article is very helpful. While we normalize the data for numeric variables, do we need to remove outliers if any exists in the data before performing PCA? Also looks like , implementation of final model in production is quite tedious, as we always have to compute components prior scoring. Thanks, Krishna Reply
sandy
sandy says: March 21, 2016 at 3:48 pm
nice one explanation Reply
Analytics Vidhya Content Team
Analytics Vidhya Content Team says: March 21, 2016 at 4:31 pm
Hi I didn't understand when you say 'name of the variable'. Did you mean rotation matrix ? Please elaborate on that. Secondly, once you have done pca, you don't need to use variables for modeling, instead use the resultant combination of components as independent variables which leads to highest explained variance. Reply
Analytics Vidhya Content Team
Analytics Vidhya Content Team says: March 21, 2016 at 4:36 pm
Hello Also mentioned in the article, data cleaning (removal of outliers, imputing missing values) are important prior to implementing principal component analysis. Such things only adds noise and inconsistency in the data. Hence, it is a good practice to sort them out first. I beg to differ on this procedure being 'tedious'. For the sake of understanding, I've explained each and every step used in this technique, may be this makes it tedious. However, if you think you have understood it correctly, just pick the important commands (codes) and get to the scree plot stage in no time. Reply
Muthupandi
Muthupandi says: March 22, 2016 at 5:24 am
Variable name am referring to is Item_Weight,Item_Fat_ContentLF Item_Fat_Contentlow fat Item_Fat_ContentLow Fat,Item_Fat_Contentreg etc., so we have to use X1 to train the model? Reply
Ankur
Ankur says: March 22, 2016 at 8:58 am
Hi Manish, while running the command >prin_comp <- prcomp(new_my_data, scale. = T) it giving error "Error in svd(x, nu = 0) : infinite or missing values in 'x'" how to rectify it.... BTW a GREAT article..... Reply
Analytics Vidhya Content Team
Analytics Vidhya Content Team says: March 22, 2016 at 12:52 pm
Hello This error says "Your data set has missing values. Please Impute". To rectify, run the code once again where I dealt with missing values. Good Luck! Reply
Anupam Basu
Anupam Basu says: March 29, 2016 at 6:01 pm
Hi Manish, Great article. I am new to R & this provides a very clear implementation obviously. I just had one quick question though. The 30 components that we will be using for further analysis which data frame is that stored in? If not stored (for the purpose of this illustration) how can I create a data frame containing the 30 components & their scores that we can use further? Thanks Again! Reply
Patrick Hagan
Patrick Hagan says: March 30, 2016 at 8:50 pm
Hello, very good article, but there seems to be a typo at the end of this line: "For Python Users: To implement PCA in python, simply import PCA from sklearn library. The interpretation remains same as explained for R users above. Ofcourse, " "Ofcourse" should be "Of course". Reply
Ravi Adannavar
Ravi Adannavar says: April 01, 2016 at 1:54 pm
Good article Reply
Thanish Batcha
Thanish Batcha says: April 04, 2016 at 9:16 am
Hi Manish, first of all your article is super cool for real. But every single tutorial about PCA talks about only extracting the important features from the data frame. No where i have come across they are talking about how we build a model with the extracted important PCA components. Since I am new to R I would love to see you explain it in R . Consider that I am handling a classification problem Data frame called train that has columns Var1, Var2, Var3.........Var19 , output The output column is the classifier(the one I want to predict in my test dataset) with features Var1... VAr19 here are my questions I remove the output variable and apply prcomp to the remaining dataset(new_dataset) How do I merge the output variable to the PCA components ? Consider am trying to use simple logistic regression Logmodel = glm(output~. data= new_dataset) Predict (Logmodel, newdata= testdata) is this correct ? should I apply the PCA to the test data too ? Reply
Manisha
Manisha says: April 07, 2016 at 2:44 am
Hi Debarshi, Can you shed some light on Factor Rotation? I have considered PCA or simple correlation matrix approach to identify correlation among variables. Then at regression stage I have used VIF. Reply
Manisha
Manisha says: April 07, 2016 at 2:51 am
Hi Manish, Thanks for the informative article. I have used PCA in SAS during scorecard development and it suggested to drop way too many variables than what I would have preferred to (I prefer to keep a few vars from each var category atleast to start with). Even after adjusting the eigen value threshold the number of vars being sacrificed was a lot. So I ended up using a simple correlation matrix approach which selects and retains highest IV variable from a group of correlated vars based on the correlation matrix with a 80% or 70% correlation threshold. Then at the regression stage I used VIF option to capture multi collinearity. Reply
Sumanta
Sumanta says: April 10, 2016 at 2:29 pm
Very nice article and quite informative. Thanks a lot for making us aware of variable reduction technique. It'll be very good if you can further show how these 30 components can be used for modelling? An example will be very good to know. Reply
thanish1991@yahoo.co.in
[email protected] says: April 11, 2016 at 12:26 pm
Hi Manish can please also explain me how do you use those components to create a model and then predict. I would love to see the code for building the model and prediction in R. Because every tutorial I see they explain only till the point of extracting the components and nobody proceeds further, that is were I am struck. Kindly help me with that. Reply
Debarshi
Debarshi says: April 14, 2016 at 11:40 am
Sadly all my analysis is in company sensitive data so cant share it here, but you can read the following articles: http://ftp.utdallas.edu/~herve/Abdi-rotations-pretty.pdf http://pareonline.net/getvn.asp?v=20&n=2 http://www.tqmp.org/Content/vol09-2/p079/p079.pdf These are very useful . I did that in SAS, but it's possible to do that in R as well. " I have considered PCA or simple correlation matrix approach to identify correlation among variables" - correlation matrix gives you the pair wise correlation, if there are linear dependencies between three or more factors, you cant trace that in correlation matrix, and that's why PCA is so useful. Reply
Prashant Sharma
Prashant Sharma says: April 19, 2016 at 6:46 am
Hello Manish, This is really great article. i learned a lot from this article. Can you please write a article on selection of logistic vs decision trees vs bagging vs svm for a given dataset?How to select which method is good for certain kind of dataset? Reply
Leon Kalmakrian
Leon Kalmakrian says: April 25, 2016 at 2:12 pm
I never usually respond to blog posts or articles but I feel sufficiently impressed (and grateful!) to do so here. Thank you so much for a well structured breakdown of PCA, taking the reader through, step by step, the technique used and the underlying rationale. Reply
pchavan
pchavan says: April 25, 2016 at 4:33 pm
Is it ok if I less 10 PCAs in stead of 44 as an o/p? Reply
D. L. von Kleeck
D. L. von Kleeck says: April 27, 2016 at 3:57 pm
Hi Manish, Doc vK here. I love your article, but have one question. In the Python for PC analysis you used a clean data, where missing values have been imputed, and categorical variables are converted into numeric. Does Python contain libraries similar to the ones used in r? Fie example/ what would be the Python code similar to the r library "Dummies"? ... I would appreciate seeing the Python code similar to the r code. Thanks! Reply
Nikhil Thakur
Nikhil Thakur says: May 05, 2016 at 6:40 pm
Nice Post, When will you publish the post on Factor Analysis? Reply
Rahul Agarwal
Rahul Agarwal says: May 12, 2016 at 2:45 am
Hello Thanish, In my understanding, you combine the training and testing data to eliminate the missing values and initial operations. Then this combined data frame is used to generate the PCA components. Each row of this PCA component refers to the corresponding output value (total number of rows being equal to number of rows of training data + number of rows of testing data). So while building the model all you can do is split the data frame in training and testing (by simply using subset function). The reason for the ith value of any PCA component correspond the ith value of output is because different principal component loadings are multiplied with the ith value of original variables. Reply
M
M says: May 13, 2016 at 12:07 am
Hi, Thanks for this article. I have a question. I have 50 observations (10x5groups) of 231 variables and I'd like to use PCA with R in order to select the best variables. The problem is that "prcomp(mydata)" yields 50 components. Thus, if I understood, it will allow me remove some observations... but I need to select variables to model all my observations. Reply
Prashant Roy
Prashant Roy says: June 20, 2016 at 11:25 am
Hi, I am also not able to find any help on using these extracted principal components to build a predictive model. Reply
Rehana Mahfuz
Rehana Mahfuz says: June 24, 2016 at 4:00 pm
In the part where you use R, in the last paragraph of number 3, I don't understand how we can infer from the figure what the first and second principal components correspond to. I would appreciate any explanation. Thank you. Reply
Siva
Siva says: July 05, 2016 at 9:47 am
Article is very informative.Thank you Manish. Reply
Norman
Norman says: July 27, 2016 at 12:45 am
Hi Manish, A great article. I have few questions. 1 How do we find features that contribute for PC1 to PC30? 2 Do you have the article for modelling stage? 3 How do we validate the model in PCA? Thanks Reply
james
james says: July 28, 2016 at 7:09 am
Hi I refer to your statement : If the two components are uncorrelated, their directions should be orthogonal (image below). Can I said that : To be a "valid" predictors, does it mean there must be NO co-relational directional arrow pointing ? In another words, the independent predictors must NOT arrow in the same direction ? What if 2 components arrow in a pictures goes in the same directions ? Reply
Analytics Vidhya Content Team
Analytics Vidhya Content Team says: July 28, 2016 at 7:42 am
Hi Norman 1. You can decide on PC1 to PC30 by looking at the cumulative variance bar plot. Basically, this plot says how many component combined can explain variance in the data. If you see carefully, after PC30, the line saturates and adding any further component doesn't help in more explained variance. 2. Just added today. 3. For validation, divide the training set into n parts. Run PCA on one part. Then, apply this resultant PCA on other parts and finally make predictions (as explained above) Reply
james
james says: July 28, 2016 at 8:37 am
biplot(prin_comp, scale = 0) The black smudges on the graphics - is it a indication that these are the predictors that contribute to the data variance ? Reply
Gonzalo Moreno
Gonzalo Moreno says: July 28, 2016 at 8:55 pm
Regards from Colombia. Great tutorial!!! Very well explained. Congratulations Reply
Sanchit
Sanchit says: July 29, 2016 at 10:09 am
Hi I have one doubt. After Predicting the Item_Outlet_Sales if i want to know which Original Predictors contributes most towards the target variable how i can find this ?? Because now all the predictors are converted into principal components . Please tell me a way to find out the relative importance of all predictor variable after reducing dthe dimension of data using PCA. Reply
Basma Alkerm
Basma Alkerm says: July 29, 2016 at 4:26 pm
looking forward to read it :) Reply
Mithilesh Singh
Mithilesh Singh says: July 30, 2016 at 5:33 am
Hi Manish I applied linear reg on same dataset big mart sales with PCs as ind variables. However my r2 reduced drastically compare to reg using original ind variables. Any idea what went wrong. Regards Mithilesh Reply
Analytics Vidhya Content Team
Analytics Vidhya Content Team says: July 30, 2016 at 11:54 am
Hi Mithilesh PCA works best when we've several numeric features. In this data set, since majority of the variables are categorical, I converted those categorical variables into numeric using one hot encoding. For a linear regression, this approach doesn't work since encoded variables might add to non-linearity in the data. Therefore, your regression model on PCA components is giving poor results. To summarize, PCA also has limitations. It wouldn't work well in all situations. If you really want to leverage its power, download data from numer.ai website, you'll enjoy it. Reply
Analytics Vidhya Content Team
Analytics Vidhya Content Team says: July 30, 2016 at 12:04 pm
Hi Sanchit See biplot, it would help you to figure out which variables contribute to which component. Reply
Mithilesh Singh
Mithilesh Singh says: July 30, 2016 at 1:26 pm
Thanks Manish. Also I missed to add the point that your article was well explained. Really appreciate your effort. Also wanted to know that prediction using PCs can be better than original variables or it is just a technique to reduce the dimensionality. Reply
Sanchit
Sanchit says: August 01, 2016 at 6:01 am
Thanks ! for replying . Reply
Olumide Michael Oyalola
Olumide Michael Oyalola says: August 01, 2016 at 10:20 am
Hi Manish, Many thanks for this detailed work on PCA. Greetings from Nigeria Reply
Barsa Nayak
Barsa Nayak says: August 02, 2016 at 6:03 am
Hi! I always enjoy your articles. Got a query. In the statement "In general, for n × p dimensional data, min(n-1, p) principal component can be constructed." Do u mean Maximum here? If not can you please explain why it is min(n-1,p)? Reply
Aditya Jain
Aditya Jain says: August 03, 2016 at 10:17 am
Beautifully Explained Manish. Really liked the part where you clarified on how to do it on test data, Reply
Ramasubramanian
Ramasubramanian says: August 03, 2016 at 11:24 am
Superb manish. what a command (over both statistics and R!). Reply
DR S.S.SENAPATI
DR S.S.SENAPATI says: August 03, 2016 at 2:09 pm
Excellent . This is at par with some of the best online courses of US universities. Very well explained in the most simple way. Waiting for your article in feature selection in R and once again Xgboost. Reply
faiza
faiza says: August 05, 2016 at 2:50 pm
kindly tell me how to find out the percentage of variance expreienced by each principal component?any command.i m using R for my analysis Reply
vij
vij says: August 08, 2016 at 8:24 am
Absolutely. In a data set, the maximum number of principal component loadings is a minimum of (n-1, p). Why is this? Reply
Amit Srivastava
Amit Srivastava says: August 16, 2016 at 5:41 pm
Agreed. PCA with lm has smaller RMSE as compared to PCA with Randomforest Reply
Rajen Choudhari
Rajen Choudhari says: August 19, 2016 at 10:10 pm
nice article Manish. :) Reply
Priyanka Gupta
Priyanka Gupta says: September 19, 2016 at 10:34 am
Hey, the variable "Item_Fat_Content" has different levels but I think 3 of them are just the same: LF, low fat & Low Fat.. The table that is posted in the article (post this command: prin_comp$rotation[1:5,1:4] ) has all 3 of them too against the principal components. So my doubt is , don't we need to club all those categories in to one? Sorry, v silly question but really new to PCA so thought should clear it out. Another question: I wanted to have a look at the correlation matrix but the cor(dataframe, method="") approach doesn't give a good graph (could be because of factor variables or due to high dimensionality of the data frame). So, what can I do to see the correlation graph/numbers or just plotting the principal components is enough? Will be glad to receive any help on this. Thanks Reply
Amit Srivastava
Amit Srivastava says: September 19, 2016 at 2:58 pm
Wanted to understand why you have calculated the std deviation and variances as these are already provided by summary(prin_comp). Similarily why did you write separate code for plotting the screeplot, when again you could have used plot(prin_comp, type="lines") or the screeplot() function Reply
Vivek
Vivek says: May 12, 2017 at 5:53 am
This is excellent explanations!!! Thank you so much for your help! Reply
Dina
Dina says: March 05, 2018 at 4:03 am
Hi Manish, Thanks for the great article. I have a question about applying the modeling part in Python. How do we apply on test PCA and scaling on test data? Reply
Omaira Valencia
Omaira Valencia says: November 17, 2022 at 4:22 am
It was great. Thanks a lot Reply

Leave a Reply Your email address will not be published. Required fields are marked *