Correlation Analysis Using R

Vipin.shrivastava 27 Jan, 2021

3 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Can you tell how the prices of gold will change if the stock market goes up or how the prices of gold associated with the stock market? Yes, you can with the help of correlation, one of the most common measures used to associate two variables. It is the most common analytical tool used in analytics.

What is Correlation?

It is a statistical measure that defines the relationship between two variables that is how the two variables are linked with each other. It describes the effect of change in one variable on another variable.

If the two variables are increasing or decreasing in parallel then they have a positive correlation between them and if one of the variables is increasing and another one is decreasing then they have a negative correlation with each other. If the change of one variable has no effect on another variable then they have a zero correlation between them.

It is used to identify the degree of the linear relationship between two variables. It is represented by 𝝆 and calculated as:-

𝜌 (𝑥, 𝑦) = 𝑐𝑜𝑣(𝑥, 𝑦) /(𝜎𝑥 × 𝜎𝑦 )

Where

𝑐𝑜𝑣(𝑥, 𝑦) = covariance of x and y

𝜎x = Standard deviation of x

𝜎𝑦 = Standard deviation of y

𝜌 (𝑥, 𝑦) = correlation between x and y

The value of 𝜌 (𝑥, 𝑦) varies between -1 to +1.

A positive value has a range from 0 to 1 where 𝜌 (𝑥, 𝑦) = 1 defines the strong positive correlation between the variables.

A negative value has a range from -1 to 0 where 𝜌 (𝑥, 𝑦) = -1 defines the strong negative correlation between the variables.

No correlation is defined if the value of 𝜌 (𝑥, 𝑦) = 0

Practical application of correlation using R:-

Determining the association between Fertility and Infant Mortality Rate (Using the existing dataset “swiss”)

Below is the code to compute the correlation

1. Loading the dataset

> data1<-swiss
> head(data1, 4)

             Fertility Agriculture Examination Education Catholic Infant.Mortality
Courtelary        80.2        17.0          15        12     9.96             22.2
Delemont          83.1        45.1           6         9    84.84             22.2
Franches-Mnt      92.5        39.7           5         5    93.40             20.2
Moutier           85.8        36.5          12         7    33.77             20.3

2. Creating a scatter plot using ggplot2 library

> library(ggplot2)

> ggplot(data1, aes(x = Fertility, y = Infant.Mortality)) + geom_point() +

+  geom_smooth(method = "lm", se = TRUE, color = 'black')

3. Testing the assumptions (Linearity and Normalcy)

Linearity^#: Visible from the plot itself (True, the relationship is linear)

Normality^$: Using Shapiro test (This is a test of normality, here we are checking whether the variables are normally distributed or not )

> shapiro.test(data1$Fertility)

	Shapiro-Wilk normality test

data:  data1$Fertility
W = 0.97307, p-value = 0.3449

> shapiro.test(data1$Infant.Mortality)

	Shapiro-Wilk normality test

data:  data1$Infant.Mortality
W = 0.97762, p-value = 0.4978

p-value is greater than 0.05, so we can assume the normality

4. Correlation Coefficient

> cor(data1$Fertility,data1$Infant.Mortality)
[1] 0.416556

5. Checking for the significance

> Tes<- cor.test(swiss$Fertility,swiss$Infant.Mortality,method = "pearson")
> 
> Tes

	Pearson's product-moment correlation

data:  swiss$Fertility and swiss$Infant.Mortality
t = 3.0737, df = 45, p-value = 0.003585
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1469699 0.6285366
sample estimates:
     cor 
0.416556

Since the p-value is less than 0.05 (here it is 0.003585, we can conclude that Fertility and Infant Mortality are significantly correlated with a value of 0.41 and a p-value of 0.003585.

Conclusion

As we can see there is a positive value between fertility and infant mortality rate, the point to be noted here is correlation is just a measure of association. It will tell the degree of association along with the direct or indirect proportionality.

Here we discussed only Pearson correlation. There are other types as well such as Kendall, Spearman, and Point-Biserial.

Linearity is a property where the relationship between the variables can be graphically represented as a straight line

Normality refers to the normal distribution (Bell-Shaped curve) of the data

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.