Discovering Insights with Chi Square Tests: A Hands-on Approach in Python

Aashish Last Updated : 26 Feb, 2024

6 min read

Introduction

Let me take you into the universe of chi-square tests and how we can involve them in Python with the scipy library. We’ll be going over the chi-square integrity of the fit test. Whether the reader is just starting or an accomplished information examiner, this guide will outfit you with pragmatic models and experiences so you can unhesitatingly apply chi-square tests in your own work.

Learning objectives

By the end of this article, readers will have:

Understood what a Chi-Square Test is and its purpose.
Recognized the different types of Chi-Square Tests.
Calculated Chi-Square in order to test any relation between two categorical variables.
Understand a project implemented in Chi-Square in Python using step-by-step instructions.

This article was published as a part of the Data Science Blogathon.

What is Chi-Square Test?
Types of Chi-Square Tests
Calculating Chi-Square
Real World Example
Frequency of Heart Failure by Obesity
Calculating Chi-Square in Python
Frequently Asked Questions

What is Chi-Square Test?

The Chi-Square test is one of the fact-based interactions used to assess the connection between two all-out factors to figure out the connection between them. This test is extremely straightforward including looking at the noticed frequencies of the factors with their normal frequencies under the supposition that there is no relationship between them. The Chi-Square trial of freedom is usually utilized kind of Chi-Square test. It is applied in circumstances where we have two straight-out factors – like obesity and heart failure event, and we need to research on the off chance that there is an association between them. By doing this we can decide if the example falls into classes in light of our assumptions for the variable dissemination.

Types of Chi-Square Tests

There are several types of Chi-Square Tests, including the chi-square goodness of fit test, the chi-square test of independence, and the chi-square test for homogeneity. The type of test used will depend on the specific research question being addressed and the type of data being analyzed.

Chi-square Goodness of Fit Test: This type of test is used to find out how the observed value of a given condition is significantly (or not) different from the expected value

Chi-square Test of Independence: This type of test is a statistical hypothesis test that can be utilized to determine if 2 categorical and nominal variables are (likely) related or not.

Chi-square Test for Homogeneity: This type of test is used by statisticians to check whether different columns and/or rows of data in a table belong to the same population (or not).

What is a Chi-Square Test? Formula, Examples & Application

Calculating Chi-Square

To calculate the Chi-Square statistic, the observed frequencies are compared to the expected frequencies. The formula for the Chi-Square statistic is:

Chi-Square = Σ((Observed – Expected)^2 / Expected)

Where Observed is the observed frequency for each category and Expected is the expected frequency for each category.

Real World Example

Let me talk through a real-world example of the Chi-Square test to understand how it can help one determine if there’s a relationship between obesity and heart failure rates. As a result, I used a sample of patients diagnosed with heart failure who had their body mass index (BMI) data to categorize them as obese or non-obese.

Now, to calculate the Chi-Square statistic, I created a contingency table showing the number of patients in each category for both obesity (based on BMI) and heart failure variables. After that I’m estimating the expected frequencies for each cell in this table assuming no association between these variables. In the end using the Chi-Square formula I compared the observed and expected frequencies to find if there was any significant association between the two variables.

If my calculated Chi-Square statistic value is greater than the critical value, I reject our null hypothesis that there’s no link between obesity and heart failure. This indicates that obesity is indeed a risk factor for heart failure. Conducting such tests helps us gain valuable insights into relationships within a sample population and develop preventative measures to improve patient outcomes.

Now that we have seen how the process works in theory, let me show you practically, how the calculations and the process works:

Frequency of Heart Failure by Obesity

Our Hypothesis

H0: Obesity and heart failure are independent
HA: Obesity and heart failure are not independent

Frequencies

Here, we calculated the total frequencies by summing up the observed frequencies.

To understand the number of obese patients who would not have undergone heart failure in our sample by chance, we will use the expected values. This is calculated by multiplying each row total by each column total, then dividing the result by the overall sample total. This will give us the expected values of obese patients who did not experience heart failure in our sample population.

Now, let’s calculate the Chi-Square value using the below formula:

Chi-Square = Σ((Observed – Expected)^2 / Expected)

And here are the results:

Finally, let’s add all the values to find out Chi-Square

Chi-Square = 53.63

Now, we need to determine an alpha level for our test. Let’s set 0.05 as an alpha level and find the critical value of Chi-Square (p). I used the Chi-Square calculator to calculate the p-value.

The p-value is less than .00001 which is, obviously, less than .05 (our alpha value)

Hence, the result is significant. In other words, we reject the null hypothesis which tells us that there is a relationship between Obesity and Heart Failure.

Calculating Chi-Square in Python

Now that we did all the fun calculations manually, let’s see if we can do the same using Python. As I previously mentioned, we will be using the scipy package’s chi2_contingency function in Python to do this.

Step 1: Create a Contingency Table

I’m using the crosstab() function from the pandas library. I’m using Heart Failure to group by in the rows and the Obesity variable to group by in the columns. We also need to set margins to true to add row and column subtotals.

heartfailure_crosstab = pd.crosstab(df['obesity'], 
df['heart_failure'], margins=True, margins_name="subtotal")

It returns a contingency table that has this data:

Step 2: Compute Chi-Square and p-values

I used the scipy.stats.chi2_contingency function to calculate both my chi-square and p values.

To use this function, I use the following line:

chi, p, dof, expected = chi2_contingency(heartfailure_crosstab)

On a successful run of this above, the function returns the chi-square value to chi, p-value to p, degrees of freedom to dof, and expected values to expected variables respectively.

Step 3: Print P Value

In the above step we stored the output values to variables chi, p, dof, and expected respectively. To find out if p value in this calculation is less than our alpha value (0.05) we use the following command:

print(p)

The output of the above command will be:

0.0000000000004257

All I need to do now is to look at the p value and compare it with my alpha to make my conclusion. The above value is less than 0.0001 which is clearly less than 0.05 (our alpha value) and that allows us to conclude that the result is significant. Hence, we reject the null hypothesis which tells us that there is a relation between Obesity and Heart Failures.

Conclusion

The manual calculation of chi-square tests, as you have seen, requires quite a bit of time and manual effort, though Python’s auto calculation using a command is much simpler and more efficient. In this article, I discussed what this test is, different types of chi-square tests, and how to perform a sample chi/square test. Additionally, we learned how to handle similar computations in Python with a single capability by doing so, saving time and effort.

5 Upcoming Python Libraries You Don’t Want to Miss in 2023

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. What is a Chi-Square Test and why is it important in data analysis?

A. A Chi-Square Test is a statistical method used to determine if there is a significant association between categorical variables. It’s crucial in data analysis as it helps identify relationships and patterns within data sets, aiding decision-making and hypothesis testing.

Q2. How do you perform a Chi-Square Test in Python?

A: Performing a Chi-Square Test in Python involves using libraries like scipy and pandas. You can use functions such as scipy.stats.chisquare() or scipy.stats.chi2_contingency() to conduct the test on categorical data, enabling hands-on analysis of relationships and dependencies.

Q3. What are the prerequisites for conducting Chi-Square Tests in Python?

A. Before conducting Chi-Square Tests in Python, one should have a basic understanding of Python programming and familiarity with libraries like pandas and scipy. Additionally, having knowledge of categorical data and the concepts of statistical hypothesis testing would be beneficial for effective utilization.

Q4. How to do Chi Square?

To perform a Chi-Square Test: Define hypotheses, collect categorical data, create a contingency table, calculate expected frequencies, compute the Chi-Square statistic, find critical value, compare, interpret, and repeat if necessary.

Aashish

Product Manager with 10+ years of experience in driving product innovation, managing product development and leading agile product teams. Experienced in developing product strategies and executing on tactical plans to ensure successful product launches and profitable product roadmaps. Skilled in developing product roadmaps, managing product portfolios, and creating product requirements.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Discovering Insights with Chi Square Tests: A Hands-on Approach in Python

Introduction

Learning objectives

Table of contents

What is Chi-Square Test?

Types of Chi-Square Tests

Calculating Chi-Square

Real World Example

Frequency of Heart Failure by Obesity

Our Hypothesis

Frequencies

Calculating Chi-Square in Python

Step 1: Create a Contingency Table

Step 2: Compute Chi-Square and p-values

Step 3: Print P Value

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Discovering Insights with Chi Square Tests: A Hands-on Approach in Python

Introduction

Learning objectives

Table of contents

What is Chi-Square Test?

Types of Chi-Square Tests

Calculating Chi-Square

Real World Example

Frequency of Heart Failure by Obesity

Our Hypothesis

Frequencies

Calculating Chi-Square in Python

Step 1: Create a Contingency Table

Step 2: Compute Chi-Square and p-values

Step 3: Print P Value

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques