Correlation Analysis Using R
This article was published as a part of the Data Science Blogathon.
Can you tell how the prices of gold will change if the stock market goes up or how the prices of gold associated with the stock market? Yes, you can with the help of correlation, one of the most common measures used to associate two variables. It is the most common analytical tool used in analytics.
Table of contents
- What is Correlation
- Practical application using R
What is Correlation?
It is a statistical measure that defines the relationship between two variables that is how the two variables are linked with each other. It describes the effect of change in one variable on another variable.
If the two variables are increasing or decreasing in parallel then they have a positive correlation between them and if one of the variables is increasing and another one is decreasing then they have a negative correlation with each other. If the change of one variable has no effect on another variable then they have a zero correlation between them.
It is used to identify the degree of the linear relationship between two variables. It is represented by 𝝆 and calculated as:-
𝜌 (𝑥, 𝑦) = 𝑐𝑜𝑣(𝑥, 𝑦) /(𝜎𝑥 × 𝜎𝑦 )
𝑐𝑜𝑣(𝑥, 𝑦) = covariance of x and y
𝜎x = Standard deviation of x
𝜎𝑦 = Standard deviation of y
𝜌 (𝑥, 𝑦) = correlation between x and y
The value of 𝜌 (𝑥, 𝑦) varies between -1 to +1.
A positive value has a range from 0 to 1 where 𝜌 (𝑥, 𝑦) = 1 defines the strong positive correlation between the variables.
A negative value has a range from -1 to 0 where 𝜌 (𝑥, 𝑦) = -1 defines the strong negative correlation between the variables.
No correlation is defined if the value of 𝜌 (𝑥, 𝑦) = 0
Practical application of correlation using R:-
Determining the association between Fertility and Infant Mortality Rate (Using the existing dataset “swiss”)
Below is the code to compute the correlation
1. Loading the dataset
> data1<-swiss > head(data1, 4)
Fertility Agriculture Examination Education Catholic Infant.Mortality Courtelary 80.2 17.0 15 12 9.96 22.2 Delemont 83.1 45.1 6 9 84.84 22.2 Franches-Mnt 92.5 39.7 5 5 93.40 20.2 Moutier 85.8 36.5 12 7 33.77 20.3
2. Creating a scatter plot using ggplot2 library
> ggplot(data1, aes(x = Fertility, y = Infant.Mortality)) + geom_point() +
+ geom_smooth(method = "lm", se = TRUE, color = 'black')
3. Testing the assumptions (Linearity and Normalcy)
Linearity#: Visible from the plot itself (True, the relationship is linear)
Normality$: Using Shapiro test (This is a test of normality, here we are checking whether the variables are normally distributed or not )
> shapiro.test(data1$Fertility) Shapiro-Wilk normality test data: data1$Fertility W = 0.97307, p-value = 0.3449 > shapiro.test(data1$Infant.Mortality) Shapiro-Wilk normality test data: data1$Infant.Mortality W = 0.97762, p-value = 0.4978
p-value is greater than 0.05, so we can assume the normality
4. Correlation Coefficient
> cor(data1$Fertility,data1$Infant.Mortality)  0.416556
5. Checking for the significance
> Tes<- cor.test(swiss$Fertility,swiss$Infant.Mortality,method = "pearson") > > Tes Pearson's product-moment correlation data: swiss$Fertility and swiss$Infant.Mortality t = 3.0737, df = 45, p-value = 0.003585 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.1469699 0.6285366 sample estimates: cor 0.416556
Since the p-value is less than 0.05 (here it is 0.003585, we can conclude that Fertility and Infant Mortality are significantly correlated with a value of 0.41 and a p-value of 0.003585.
As we can see there is a positive value between fertility and infant mortality rate, the point to be noted here is correlation is just a measure of association. It will tell the degree of association along with the direct or indirect proportionality.
Here we discussed only Pearson correlation. There are other types as well such as Kendall, Spearman, and Point-Biserial.
Linearity is a property where the relationship between the variables can be graphically represented as a straight line
Normality refers to the normal distribution (Bell-Shaped curve) of the data
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.
Leave a Reply Your email address will not be published. Required fields are marked *