Pearson and Spearman correlation coefficients are two widely used statistical measures when measuring the relationship between variables. The Pearson correlation coefficient assesses the linear relationship between variables, while the Spearman correlation coefficient evaluates the monotonic relationship. In this article, we will delve into a comprehensive comparison of these correlation coefficients. We will explore their calculation methods, interpretability, strengths, and limitations. Understanding the differences between Pearson and Spearman correlation coefficients is crucial for selecting the appropriate measure based on the nature of the data and the research objectives. Let’s explore the difference between Pearson vs Spearman Correlation Coefficients!

Correlation is a statistical measure that tells us about the association between the two variables. It describes how one variable behaves if there is some change in the other variable.

If the two variables are increasing or decreasing in parallel then they have a positive correlation between them and if one of the variables is increasing and another one is decreasing then they have a negative correlation with each other. If the change of one variable has no effect on another variable then they have a zero correlation between them.

Pearson Correlation Coefficient | Spearman Correlation Coefficient | |
---|---|---|

Purpose | Measures linear relationships | Measures monotonic relationships |

Assumptions | Variables are normally distributed, linear relationship | Variables have monotonic relationship, no assumptions on distribution |

Calculation Method | Based on covariance and standard deviations | Based on ranked data |

Range of Values | -1 to 1 | -1 to 1 |

Interpretation | Strength and direction of linear relationship | Strength and direction of monotonic relationship |

Sensitivity to Outliers | Sensitive to outliers | Less sensitive to outliers |

Data Types | Appropriate for interval and ratio data | Appropriate for ordinal and non-normally distributed data |

Usage | Assessing linear associations, parametric tests | Assessing monotonic associations, non-parametric tests |

The Pearson correlation coefficient is a statistical measure that quantifies the strength and direction of a linear relationship between two variables. It ranges from -1 to 1, with values close to -1 indicating a strong negative linear relationship, values close to 1 indicating a strong positive linear relationship, and 0 indicating no linear relationship.

The Spearman correlation coefficient is a statistical measure that assesses the strength and direction of a monotonic relationship between two variables. It ranks the data rather than relying on their actual values, making it suitable for non-normally distributed or ordinal data. It ranges from -1 to 1, where values close to -1 or 1 indicate a strong monotonic relationship, and 0 indicates no monotonic relationship. Spearman correlation is valuable for detecting and quantifying associations when linear relationships are not assumed or when dealing with ranked or ordinal data.

Spearman’s Rank Correlation:

Let’s say we want to determine the relationship between the study time (in hours) and the exam scores (out of 100) of a group of students. We have the following data for five students:

Student | Study Time (hours) | Exam Score |
---|---|---|

A | 10 | 75 |

B | 8 | 60 |

C | 12 | 85 |

D | 6 | 55 |

E | 9 | 70 |

First, we rank the study time and exam scores separately:

Student | Study Time (hours) | Rank (Study Time) | Exam Score | Rank (Exam Score) |
---|---|---|---|---|

A | 10 | 3 | 75 | 3 |

B | 8 | 4 | 60 | 5 |

C | 12 | 1 | 85 | 1 |

D | 6 | 5 | 55 | 6 |

E | 9 | 2 | 70 | 4 |

Now, we calculate the differences between the ranks for each pair of data points:

- P=Rank of Study Time−Rank of Exam Score,
*Di*=Rank of Study Time*i*−Rank of Exam Score*i*

Student | Di |
---|---|

A | 0 |

B | -1 |

C | 0 |

D | -1 |

E | -2 |

Next, we square each (*Di)* value:

Student | 2Di2 |
---|---|

A | 0 |

B | 1 |

C | 0 |

D | 1 |

E | 4 |

The sum of ��2*D**i*2 is 0+1+0+1+4=60+1+0+1+4=6.

Finally, we use the Spearman’s Rank Correlation formula:

*ρ*=1−*n*(*n*2−1)6∑(*Di*2)

Where:

*n*is the number of data points (in this case, 5)- ∑(
*Di*2) is the sum of the squared differences

Plugging in the values:

*ρ*=1−5(52−1)6×6 P=1−365(25−1)*ρ*=1−5(25−1)36 p=1−365(24)*ρ*=1−5(24)36 p=1−36120*ρ*=1−12036 p=1−0.3*ρ*=1−0.3 p=0.7*ρ*=0.7

So, the Spearman’s Rank Correlation coefficient (*ρ*) between study time and exam scores is 0.7, indicating a strong positive correlation.

Determining the association between Girth and Height of Black Cherry Trees (Using the existing dataset “trees” which is already present in r and can be accessed by typing the name of the dataset, list of all the data set can be seen by using the command data() )

Below is the code to compute the correlation:

```
> data <- trees
> head(data, 3)
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
```

```
> library(ggplot2)
> ggplot(data, aes(x = Girth, y = Height)) + geom_point() +
+ geom_smooth(method = "lm", se =TRUE, color = 'red')
```

Here two assumptions are checked which need to be fulfilled before performing the correlation (Shapiro test, which is test to check the input variable is following the normal distribution or not, is used to check whether the variables i.e. Girth and Height are normally distributed or not)

```
> shapiro.test(data$Girth)
Shapiro-Wilk normality test
data: data$Girth
W = 0.94117, p-value = 0.08893
> shapiro.test(data$Height)
Shapiro-Wilk normality test
data: data$Height
W = 0.96545, p-value = 0.4034
```

p–value is greater than 0.05, so we can assume the normality

```
> cor(data$Girth,data$Height, method = "pearson")
[1] 0.5192801
> cor(data$Girth,data$Height, method = "spearman")
[1] 0.4408387
```

```
> Pear <- cor.test(data$Girth, data$Height, method = 'pearson')
> Pear
Pearson's product-moment correlation
data: data$Girth and data$Height
t = 3.2722, df = 29, p-value = 0.002758
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.2021327 0.7378538
sample estimates:
cor
0.5192801
```

```
> Spear <- cor.test(data$Girth, data$Height, method = 'spearman')
> Spear
Spearman's rank correlation rho
data: data$Girth and data$Height
S = 2773.4, p-value = 0.01306
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.4408387
```

Since the p-value is less than 0.05 (For Pearson it is 0.002758 and for Spearman, it is 0.01306, we can conclude that the Girth and Height of the trees are significantly correlated for both the coefficients with the value of 0.5192801 (Pearson) and 0.4408387 (Spearman).

As we can see both the correlation coefficients give the positive correlation value for Girth and Height of the trees but the value given by them is slightly different because Pearson correlation coefficients measure the linear relationship between the variables while Spearman correlation coefficients measure only monotonic relationships, relationship in which the variables tend to move in the same/opposite direction but not necessarily at a constant rate whereas the rate is constant in a linear relationship.

A. The Pearson and Spearman correlation measures the strength and direction of the relationship between variables. Pearson correlation assesses linear relationships, while Spearman correlation evaluates monotonic relationships.

A. Spearman correlation is useful when the relationship between variables is not strictly linear but can be described by a monotonic function. It is commonly used when dealing with ordinal or non-normally distributed data.

It is inaccurate to say that Spearman correlations are inherently more powerful than Pearson correlations. The choice between the two depends on the specific characteristics and assumptions of the data and the research question being addressed.

A. Spearman correlation is not always higher than Pearson correlation. The magnitude and direction of the correlation can differ between the two measures, especially when the relationship between variables is nonlinear or influenced by outliers. The choice between the two should be based on the data and the research objectives.

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist
##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask

Thanks a lot. This is really useful.