In any data science project, the Statistical Data Exploration phase, or Exploratory Data Analysis (EDA), plays a crucial role in model building. It begins once we’ve translated our business problem into a data science problem and have identified and listed all associated hypotheses. This phase aims to uncover key characteristics and hidden patterns within the dataset. This article focuses on conducting Data Exploration using statistical measures such as P-values, R-squared, hypothesis testing, and Analysis of Variance (ANOVA) to compare different groups, emphasizing practical application over theoretical concepts.

Analytical tools like Tableau for visualizations and Python packages like scipy for statistical tests such as one-way ANOVA and comparison of f-ratio are employed. While many statistical tests assume a bell curve distribution, in this case, the Dependent Variable (study variable) exhibits a Gaussian curve shape, prompting a statistical exploration to draw inferences.

In regression analysis and statistical data exploration, R-squared and P-value are critical measures often overlooked. However, modern analytical tools like Tableau or Power BI simplify the computation of these measures and facilitate the creation of informative plots with trend lines. Leveraging these tools allows for efficient inference generation without extensive coding.

- Statistical Data Exploration on Variance of the Dependent Variable by an Independent Variable
- How to draw inference from P-Value and R Squared scores with the real-time data
- Comparing two different systems parameters using statistical tests like Anova

**This article was published as a part of the ***Data Science Blogathon***.**

- Important Terms used in Interpreting P-Value and R Squared Score
- Statistical Data Exploration on Variance of the Dependent Variable by an Independent Variable
- How to Draw Inference from P-Value and R Squared Scores with Real-time Data?
- Utilizing ANOVA to Assess Different System Parameters
- Frequently Asked Questions

This article is divided into 3 sections as in the Overview. But before we go to the individual sections, here are a few statistical data exploration terms we should be familiar with:

We often denote this as R2 or r2, more commonly known as R Squared, indicating the extent of influence a specific independent variable exerts on the dependent variable. Typically ranging between 0 and 1, values below 0.3 suggest weak influence, while those between 0.3 and 0.5 indicate moderate influence. Values exceeding 0.7 signify a strong effect on the dependent variable. Further discussion on this topic will be provided later in the blog.

The P-value is a probabilistic measure indicating the likelihood that an observed value occurred by random chance. It assesses the significance of differences observed in the dependent variable when the corresponding independent variable changes. A lower P-value signifies a greater significance of the observed difference. Typically used in statistical hypothesis testing, a P-value < 0.05 suggests rejection of the null hypothesis, while P > 0.05 indicates no significant differences when the variable changes. In the figure below, the shaded portion illustrates the P-value.

The idea here is to reject or nullify the Null Hypothesis and come up with the Alternate Hypothesis, that better explains the phenomenon.

This is contrary to the Null Hypothesis which is to say It is the opposite of the Null Hypothesis. For example, if a Null Hypothesis states that “I am going to win $10” then the Alternate Hypothesis would be “I am going to win more than $10”. We are checking if there is enough evidence (with the Alternate Hypothesis) to reject the Null Hypothesis. The hypothesis test can be one-tailed or two-tailed as in the figure below which depicts the standard normal model ( mean =0, the standard deviation of 1). Here the Pc is the critical value or test statistics:

The Confidence Interval (CI) is the range of values (-R,+R), we are sure that our population parameter (true value) lies in. This is mainly used in Hypothesis testing. The significance level defines how much evidence we require to reject H0 in favor of Ha. It serves as the cutoff. The default cutoff commonly used is 0.05. CI table with critical values and alpha values at (1%,5%,10%) significance level for a standard normal distribution is listed below:

Usually, when regression is referred to in the context of machine learning, we mean the line of linear regression and y-intercept, the point where this line cuts the y-axis. This line can be mathematically represented as a straight line passing through the data point coordinates of (independent variable, dependent variable). In an equation form,

y = m * x + C, where C is the y-intercept and m is the gradient or slope

in real-time situations, this may not be always a straight line and there will be a nonlinearity in the independent variables or predictors in relation to the dependent variable or the variable we want to predict the outcome. so we, need to look at other regressions like polynomial exponential, or even logarithmic based on the dataset we are mining. in this article, I have data ( target variable ) which sort of looks like a Gaussian curve and hence I will be trying to fit a polynomial regression on it.

In statistics, polynomial regression is a form of regression analysis that considers the nonlinearity of independent variables, and the target variable is modeled as the nth-degree polynomial of the predictor variables. That is

y = b0 + b1 * x1 + b2 * x22 + b3 * x33 + ….. bn * xnn

where y is the target or dependent variable,

,b1,b2 ….bn are the regression coefficients and y-intercept of b0 for each degree of the polynomial, and x1,x2 …xn are the predictors or the independent variables.

For the demonstration, I will take 3 independent variables (Temperature, Current, Voltage) and the dependent variable (Power) from my private project dataset. The data pertains to the energy system, wherein we have continuous instantaneous power generated at each timestep on any given day for the time the system is active. Let’s take a look at the power trend plot ( generated using Tableau) on any given day.

The above plot is quite similar to a bell curve, with lots of spikes that can be seen as this is the instantaneous power generated in 35 – 45 sec durations.

`df.dtypes`

```
Datetime object
Power float64
Temperature float64
Current float64
Voltage float64
dtype: object
```

Sample data frame records

As we can see Power value changes every 30-40 sec. The dataset contains data for two years 2019 and 2020. Let us look at the scatter plots of the dependent and each of the independent variables for a particular month.

- Temperature value ranges from 42 to 65 for most of the cases when the device actively (Power >0) generated Power
- Voltage value ranges from 18 to 45 for most of the cases when the device actively (Power >0) generated Power
- The current seems to be having a strong linear sort of relationship and Power is at its maximum when the current value is close to 10

As the Output seems to have a trend of a Normal curve, I will be testing it with a polynomial regression ( for the nonlinearity of degree 6). We can also try to fit 3rd order polynomial, basically a sort of hyperparameter. I have used the Tableau analytical tool here as we can do a bit of statistical analytics and draw trend lines etc with ease without having to write our code.

Next, let us see how to interpret these values in the next section

This can be drawn from *Tableau desktop -Analytics -Model-Trend lines- Polynomial*

Before we do some interpretation of the data, we need to gather all that somewhere. I have got those values month-wise for a device and stored them in the form of tabular data. (see below). let us understand the data first. There are 12 rows and 9 columns. The rows contain the month’s data and columns have data of 3 independent variables in relation to the target. The first three columns have the median value ( you can also use mean values ) of that particular month, the next three columns have the P-value and the last three have the R-squared values. The green lines are the polynomial trend lines.

From the above table, we can make some first-hand inferences like:

- All independent variables indicate a rejection of the Null Hypothesis, suggesting strong evidence that these predictors influence the target. Most values exhibit P < 0.0001, signifying robust statistical support for the Alternate Hypothesis, indicating a change in predictors correlates with a change in the target variable.
- The R-squared score reveals the predictors’ impact on the dependent variable. Current emerges as the most influential variable, followed by Temperature and Voltage.
- Throughout the study period, the device consistently generates 140-160 watts of power. Hence, it can be reliably inferred that the device is capable of producing at least 140 watts of power daily.

- A robust linear relationship between Current and Power is evident, as changes in Current proportionately affect Power. This observation is supported by consistently high R-squared scores, often nearing 0.99, indicating strong predictive power. The scatterplot above visually confirms this linear relationship with Current.
- The maximum Power output of 378 watts corresponds to specific predictor values: Current at 10.42, Temperature at 60.62, and Voltage at 36.30. However, further investigation is warranted to uncover additional patterns or combinations that optimize output.
- During March and April, unusual trends emerge, characterized by lower median Temperatures (≤53) and higher median Voltages (≥38), accompanied by lower median Currents compared to other months. These observations align with corresponding R-squared scores. Notably, when Temperature decreases, Voltage appears to exert greater influence on the Target variable, and vice versa. The degree of influence can be discerned from the median Current values.
- Leveraging these insights, real-time alerts can be established to monitor predictor values, focusing on key thresholds and recent median Current trends. This approach validates two hypotheses: varying degrees of dominance based on Current median and Temperature’s fluctuating influence on the target, particularly when Temperature median shifts.
- The unusually high P-value for Temperature in March and April signals potential abnormal device behavior. Such occurrences, where the Null Hypothesis cannot be rejected, should prompt further investigation into anomalous device performance in future instances.

Analysis of Variance and F-statistics: We perform ANOVA tests to compare two groups (in this case, 2 different devices) and compute the F-statistics to determine variability.

In this section, I conduct several statistical hypotheses tests using similar data from another device. I demonstrate how to perform a one-way ANOVA test on a particular independent variable of two different devices. If these devices are placed adjacent to one another at the same location, then we fail to reject the Null Hypothesis as both devices would perform similarly. However, if these devices are placed elsewhere at different geographical locations, then we observe variance. Below, we present the data of device 2 at another distant location. Using Python’s scipy, we conduct a simple test to compare the Temperature variability of these 2 devices and evaluate the f-ratio for each month. For demonstration purposes, we focus on data from April to August to calculate the f-ratio.

We can also do more complex tests like

- checking the day-wise variability instead of the monthly value
- Use the f-ratio as a new feature that can later be used in predictive modeling to predict the value, for instance in device 2, we can see that the values for Sep and Oct are not available or use it for complex analytics like What if Analysis ( New device is placed in a particular location and we need to forecast the values for this device based on the results we have got from the current devices)

```
Enter the temperature scores of the 2 devices
device1 = [52.34,57.36,53.47,57.84,56.21]
device2 = [61.97,65.42,64.27,62.98,63.22]
from scipy.stats import f_oneway
f_oneway(device1,device2)
#perform one-way ANOVA
F_onewayResult(statistic=43.35900660252281, pvalue=0.00017210195536532808)
since the pvalue is < 0.05, we reject the Null Hypothesis. So the population mean of the 2 devices are not same.
F = variation between sample means / variation within the samples (43 in this case)
```

This article emphasizes statistical data exploration’s vital role in model building within data science projects. Utilizing regression models, sample size, adjusted R-squared, correlation coefficients, and other metrics, we drew valuable insights. Through polynomial regression, we analyzed the variance of the dependent variable against independent variables, uncovering nuanced relationships. Real-time data interpretation, focusing on P-values and R-squared scores, offered actionable insights. Moreover, ANOVA facilitated comparing different system parameters, shedding light on device performance. This article underscores the importance of meticulous exploration, hypothesis testing, and continuous inquiry in data analysis, essential for robust model development across diverse datasets.

A. R-squared, or the coefficient of determination, measures the proportion of the dependent variable’s variance predictable from the independent variable(s). A higher R squared (closer to 1) indicates better explanatory power, but no universal threshold defines a “good” value.

A. A good R-squared varies based on factors like dataset, predictors, and sample size. Generally, higher values suggest better model fit. Adjusted R-squared, considering predictors and sample size, provides a more accurate measure.

A. A high R-squared in regression analysis signifies strong model fit, indicating how well the model explains variability in the response variable. However, context, outliers, and other diagnostics are crucial for interpretation.

A. An R-squared of 0.3 implies 30% of the dependent variable’s variability explained by the predictors. Context, data nature, and model specifics influence interpretation of adequacy.

A. An R-squared of 0.4 indicates 40% of the dependent variable’s variability explained by the model’s independent variables. Context, data nature, and model criteria impact assessment of model fit.

*The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.*

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

I am a Ghanaian doing my PhD at VIT UNIVERSITY, INDIA - TAMIL NADU. Information is very useful. Please, any suggested textbook where I can do further reading on R-Sqaured values or score. Thank you.