In this article, I will share my thoughts on the below
- Statistical Data Exploration on Variance of the Dependent Variable by an Independent Variable
- How to draw inference from P-Value and R Squared score with the real-time data
- Comparing two different systems parameters using statistical tests like Anova
In any Data science project, The Statistical Data Exploration phase or Exploratory Data Analysis (EDA) is key to any model building. This will commence as soon as we are ready with our Business Problem converted to a Data Science problem and identified and listed all the hypotheses surrounding it. Here we will try to find the main characteristics and hidden patterns from the given dataset. The focus of this article will be on how to go about the Data Exploration using some of the statistical measures like P, R2, Hypothesis testing, and the Analysis of variance for comparing two different groups with a focus more on the application side rather than the concepts itself.
I have used analytical tools like Tableau for getting some useful plots and python packages like scipy for a statistical test like one way Anova and compare the f-ratio. Most of the statistical tests give good results if the data has the shape of a bell curve. in my case, the Dependent Variable ( study variable) is a sort of Gaussian curve, hence I would like to explore the data statistically and do inferences based on it.
The two most important measures used in regression analysis and statistical data exploration tests like hypothesis testing are the R Squared and the P-value but often times we hardly ever consider these in our analysis. But with modern analytical tools like Tableau or power bi, we can generate some good plots with trend lines as well as get these measures computed easily instead of writing our own codes and we can use them for inferences.
This article is divided into 3 sections as in the Overview. But before we go to the individual sections, here are a few statistical data exploration terms we should be familiar with-
Coefficient of determination:
This is often denoted as R2 or r2 and more commonly known as R Squared is how much influence a particular independent variable has on the dependent variable. the value will usually range between 0 and 1. Value of < 0.3 is weak , Value between 0.3 and 0.5 is moderate and Value > 0.7 means strong effect on the dependent variable. We will come back to this later in the blog.
This is a probabilistic measure that an observed value was a random chance. That there were no significant changes observed in the dependent variable when the corresponding independent variable changed. thus, the lower the P-value, the greater the significance of the observed difference. This is generally used in statistical hypothesis testing and usually P < 0.05 would mean a null hypothesis can be rejected and P > 0.05 would mean there are no significant differences when the variable changed. In the figure below, the shaded portion represents the P-value.
Null Hypothesis H0:
The idea here is to reject or nullify the Null Hypothesis and come up with the Alternate Hypothesis, that better explains the phenomenon.
Alternate Hypothesis Ha:
This is contrary to the Null Hypothesis that is to say It is the opposite of the Null Hypothesis. For example, if a Null Hypothesis states that “I am going to win $10” then the Alternate Hypothesis would be “I am going to win more than $10”. Basically, we are checking if there is enough evidence (with the Alternate Hypothesis) to reject the Null Hypothesis. The hypothesis test can be one-tailed or two-tailed as in the figure below which depicts the standard normal model ( mean =0, the standard deviation of 1). Here the Pc is the critical value or test statistics
Confidence Interval and Level of Significance (Alpha):
The Confidence Interval (CI) is the range of values (-R,+R), we are sure that our population parameter (true value) lies in. This is mainly used in Hypothesis testing. The significance level defines how much evidence we require to reject H0 in favor of Ha. It serves as the cutoff. The default cutoff commonly used is 0.05. CI table with critical values and alpha values at (1%,5%,10%) significance level for a standard normal distribution is listed below
Regression lines and Equation:
Usually, when regression is referred to in the context of machine learning, we mean the line of linear regression and y-intercept, the point where this line cut the y axis. This line can be mathematically represented as a straight line passing through the data points coordinates of (independent variable, dependent variable). In an equation form,
y = m * x + C, where C is the y-intercept and m is the gradient or slope
in real-time situations, this may not be always a straight line and there will be a nonlinearity in the independent variables or predictors in relation to the dependent variable or the variable we want to predict the outcome. so we, need to look at other regressions like polynomial or exponential or even logarithmic based on the dataset we are mining. in this article, I have data ( target variable ) which sort of look like a gaussian curve and hence I will be trying to fit a polynomial regression on it.
In statistics, polynomial regression is a form of regression analysis that considers the nonlinearity of independent variables, and the target variable is modeled as the nth degree polynomial of the predictor variables. That is
y = b0 + b1 * x1 + b2 * x22 + b3 * x33 + ….. bn * xnn
where y is the target or dependent variable,
,b1,b2 ….bn are the regression coefficients and y-intercept of b0 for each degree of the polynomial, and x1,x2 …xn are the predictors or the independent variables.
Statistical Data Exploration on Variance of the Dependent Variable by an Independent Variable
For the demonstration, I will take 3 independent variables (Temperature, Current, Voltage) and the dependent variable (Power) from my private project dataset. The data pertains to the energy system, wherein we have continuous instantaneous power generated at each timestep on any given day for the time the system is active. Let’s take a look at the power trend plot ( generated using tableau) on any given day.
The above plot is quite similar to a bell curve, with lots of spikes that can be seen as this is the instantaneous power generated in the duration of 35 – 45 sec durations.
Datetime object Power float64 Temperature float64 Current float64 Voltage float64 dtype: object
Sample data frame records
As we can see Power value changes every 30-40 sec. The dataset contains data for two years 2019 and 2020. Let us look at the scatter plots of dependent and each of the independent variables for a particular month.
We can see that
- Temperature value ranges from 42 till 65 for most of the case when the device actively (Power >0) generated Power
- Voltage value ranges from 18 till 45 for most of the cases when the device actively (Power >0) generated Power
- The current seems to be having a strong linear sort of relationship and Power is at the maximum when the current value is close to 10
As the Output seems to be a having a trend of a Normal curve, I will be testing it with a polynomial regression ( for the nonlinearity of degree 6). We can also try to fit 3rd order polynomial, basically sort of hyperparameter. I have used the tableau analytical tool here as we can do a bit of statistical analytics and draw trend lines etc with ease without having to write our own code.
Next, let us see how to interpret these values in the next section
This can be drawn from Tableau desktop -Analytics -Model-Trend lines- Polynomial
How to draw inference from P-Value and R Squared score with the real-time data
Before we do some interpretation on the data, we need to gather all that somewhere. I have got those values month wise for a device and stored it in the form of tabular data. (see below). let us understand the data first. There are 12 rows and 9 columns. The rows contain the month’s data and columns have data of 3 independent variables in relation to the target. The first three columns have the median value ( you can also use mean values ) of that particular month, the next three columns have the P-Value and the last three has the R squared values. The green lines are the polynomial trend lines.
Data interpretations – I Gather some visible facts :
From the above table, we can do some first-hand inferences like
1. All the independent variables point towards rejecting the Null Hypothesis. That is there is an evidence that these predictors does influence the target. Usually we consider only 0.05 or 5% significance level but in the above data most of the values have P < 0.0001 which means 999 out of every 1000 data there is a statistically strong evidence in favor of Alternate Hypothesis that if this predictor change there will be change in the target too.
2. From the R2 score, we can infer the magnitude of the influence the predictors will have on the dependent variable. We can see that the Current is the most influential variable followed by Temperature and Voltage.
3. This device has consistently produced 140-160 watts of power throughout the study time period. In other words we can safely infer that this device is capable of producing >= 140 (at least) watts of power on any given day.
Data interpretations – II Deeper insights :
1. There is a strong linear relationship of Current on the Power and as the value of Current increase or decrese , the value of the Power too increases or decreases proportionately. This can also be infered from the R2 scrore which is consistently having value close to 0.99. Refer to the scatterplot above which also shows a linear relationship with Current.
2. Max value of the target variable ( Power 378 watts) was reached when the predictors in review had values (Current = 10.42, Temperature = 60.62, Voltage = 36.30). This is just an information currently. We need to dig further on it to get more clues or pattern combination for getting maximum output.
3. During the entire experment time period, the month of March and April seems to be unusual in that during these months the Temperature median were at the lowest (<=53) and the Voltage median was on the higher side (>=38). Also the Current median was also at its lowest compared to other months. This can also be infered from the corresponding R2 score of that month. One more interesting inference is that whenever the Temperature was on the lower side, it was Voltage that influenced the Target variable more and vice versa and the degee of influence can be intutively seen from the value of the Current median value.
4. Based on these, we can setup a real-time alerts in the system to monitor these predictors for those key values along with last D days Current Median and validate the two hypothesis ‘based on the Current Median, the degree of dominance may vary‘ and another Hypothesis – ‘When the Temperature Median is on the higher side, it is the Temperature that will have more influence on the target than the Voltage and in case the Temperature Median falls , then Voltage would influence more.
5. P value for the Temperature is unusally high in the month of Mar and April and this is again a trigger for some abnormal behaviour in the device when we encounter such P value in future. This basically means ,a situation were we fail to reject the Null Hypothesis
Comparing two different systems parameters using statistical tests like Anova
Analysis of Variance and F-statistics: ANOVA test can be performed to compare two groups (in this case 2 different devices ) and compute the F-statistics to determine the variability.
In this section, I have used similar data from another device to do a few statistical hypotheses testing. I have demonstrated how a one-way ANOVA test can be done on a particular independent variable of two different devices. if these devices are placed adjacent to one another at the same location, then we will fail to Reject the Null Hypothesis as both the devices would perform at par but if these devices are placed else were at different geographical locations, then there will be variance observed. Below is the data of device 2 at another distant location. using python’s scipy, we can do a simple test to compare the Temperature variability of these 2 devices and evaluate the f-ratio for each month. For the demonstration, I have taken Apr till Aug to get the f-ratio.
We can also do more complex tests like
- checking the day-wise variability instead of months value
- Use the f-ratio as new features that can later be used in predictive modeling to predict the value, for instance in device 2, we can see that the value for Sep and Oct are not available or use it for complex analytics like What if Analysis ( New device is placed in a particular location and we need to forecast the values for this device based on the results we have got from the current devices)
Simple Python example to test the Hypothesis using one way ANOVA :
Enter the temperature scores of the 2 devices device1 = [52.34,57.36,53.47,57.84,56.21] device2 = [61.97,65.42,64.27,62.98,63.22] from scipy.stats import f_oneway f_oneway(device1,device2) #perform one-way ANOVA F_onewayResult(statistic=43.35900660252281, pvalue=0.00017210195536532808) since the pvalue is < 0.05, we reject the Null Hypothesis. So the population mean of the 2 devices are not same. F = variation between sample means / variation within the samples (43 in this case)
In this blog, I have only shared a few ideas for statistical data exploration and also identify the new hypothesis surrounding the dependent and the independent variables. A similar analysis can be done in any other dataset as well All the possible hypotheses are ideally identified and listed even before the EDA – exploratory data analysis stage begins but sometimes during the EDA also we can get few insights related to the domain the data pertains that were not conceptualized or missed out earlier or because of the unknown (invisible) variable influencing the target.