Comparison of Pearson and Spearman correlation coefficients
In this article, we will be discussing two different types of correlation coefficients i.e. Pearson correlation coefficient and Spearman correlation coefficient, and see whether they will give the same level of strength or is there any deviation between the two.
Table of contents:-
· What is Correlation
· Pearson vs Spearman correlation
· Practical application of correlation using R
What is Correlation?
Correlation is a statistical measure that tells us about the association between the two variables. It describes how one variable behaves if there is some change in the other variable.
If the two variables are increasing or decreasing in parallel then they have a positive correlation between them and if one of the variables is increasing and another one is decreasing then they have a negative correlation with each other. If the change of one variable has no effect on another variable then they have a zero correlation between them.
Pearson vs Spearman correlation?
Both Pearson and Spearman are used for measuring the correlation but the difference between them lies in the kind of analysis we want.
Pearson correlation: Pearson correlation evaluates the linear relationship between two continuous variables.
Spearman correlation: Spearman correlation evaluates the monotonic relationship. The Spearman correlation coefficient is based on the ranked values for each variable rather than the raw data.
Practical application of correlation using R?
Determining the association between Girth and Height of Black Cherry Trees (Using the existing dataset “trees” which is already present in r and can be accessed by typing the name of the dataset, list of all the data set can be seen by using the command data() )
Below is the code to compute the correlation
1. Loading the dataset
> data <- trees > head(data, 3) Girth Height Volume 1 8.3 70 10.3 2 8.6 65 10.3 3 8.8 63 10.2
2. Creating a scatter plot using ggplot2 library
> library(ggplot2) > ggplot(data, aes(x = Girth, y = Height)) + geom_point() + + geom_smooth(method = "lm", se =TRUE, color = 'red')
3. Test for assumptions of correlation, here two assumptions are checked which need to be fulfilled before performing the correlation (Shapiro test, which is test to check the input variable is following the normal distribution or not, is used to check whether the variables i.e. Girth and Height are normally distributed or not)
> shapiro.test(data$Girth) Shapiro-Wilk normality test data: data$Girth W = 0.94117, p-value = 0.08893 > shapiro.test(data$Height) Shapiro-Wilk normality test data: data$Height W = 0.96545, p-value = 0.4034
p–value is greater than 0.05, so we can assume the normality
> cor(data$Girth,data$Height, method = "pearson")  0.5192801 > cor(data$Girth,data$Height, method = "spearman")  0.4408387
5. Testing the significance of the correlation
> Pear <- cor.test(data$Girth, data$Height, method = 'pearson') > Pear Pearson's product-moment correlation data: data$Girth and data$Height t = 3.2722, df = 29, p-value = 0.002758 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.2021327 0.7378538 sample estimates: cor 0.5192801
> Spear <- cor.test(data$Girth, data$Height, method = 'spearman') > Spear Spearman's rank correlation rho data: data$Girth and data$Height S = 2773.4, p-value = 0.01306 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.4408387
Since the p-value is less than 0.05 (For Pearson it is 0.002758 and for Spearman, it is 0.01306, we can conclude that the Girth and Height of the trees are significantly correlated for both the coefficients with the value of 0.5192801 (Pearson) and 0.4408387 (Spearman).
As we can see both the correlation coefficients give the positive correlation value for Girth and Height of the trees but the value given by them is slightly different because Pearson correlation coefficients measure the linear relationship between the variables while Spearman correlation coefficients measure only monotonic relationships, relationship in which the variables tend to move in the same/opposite direction but not necessarily at a constant rate whereas the rate is constant in a linear relationship.