Kajal Kumari — May 3, 2021
Algorithm Beginner Machine Learning Python Structured Data Supervised
This article was published as a part of the Data Science Blogathon.

DECISION TREE

Decision tree learning or classification Trees are a collection of divide and conquer problem-solving strategies that use tree-like structures to predict the value of an outcome variable.

The tree starts with the root node consisting of the complete data and thereafter uses intelligent strategies to split the nodes into multiple branches.

The original dataset divided into subsets in this process.

To answer the fundamental inquiry, your oblivious brain makes a few computations (in light of the example questions recorded below) and you wind up purchasing the necessary amount of milk. Is it normal or weekday?

On weekdays days we require 1 Liter of Milk.

Is it a weekend? On weekends we require 1.5 Liter of Milk

Is it accurate to say that we are anticipating any guests today? We need to purchase 250 ML additional milk for every guest, and so on.

Before jumping into the hypothetical idea of decision trees how about we initially explain what are decision trees? what’s more, for what reason would it be a good idea for us to utilize them?

 

Why use decision trees?

Outstanding amongst other supervised learning methods are tree-based algorithm. These are predictive models with higher accuracy, simple understanding.

How does the decision tree work?

There are different algorithm written to assemble a decision tree, which can be utilized by the problem

A few of the commonly used algorithms are listed below:

• CART

• ID3

• C4.5

• CHAID

Now we will explain about CHAID Algorithm step by step. Before that, we will discuss a little bit about chi_square.

chi_square

Chi-Square is a statistical measure to find the difference between child and parent nodes. To calculate this we find the difference between observed and expected counts of target variable for each node and the squared sum of these standardized differences will give us the Chi-square value.

Formula

To find the most dominant feature, chi-square tests will use that is also called CHAID whereas ID3 uses information gain, C4.5 uses gain ratio and CART uses the GINI index.

Today, most programming libraries (e.g. Pandas for Python) use Pearson metric for correlation by default.

The formula of chi-square:-

√((y – y’)2 / y’)

where y is actual and y’ is expected.

Data set

We are going to build decision rules for the following data set. The decision column is the target we would like to find based on some features.

By The Way, we will ignore the day column because it just the row number.

CHAID data

to read dataset from CSV file python implementation below:-

import pandas as pd
data = pd.read_csv("dataset.csv")

data.head()

We need to find the most important feature w.r.t target columns to choose the node to split data in this data set.

 

Humidity feature

There are two types of the class present in humidity columns such that high and normal. Now we will calculate the chi_square values for them.

yes No Total Expected Chi-square Yes Chi-square    No
High 3 4 7 3.5 0.267 0.267
low 6 1 7 3.5 1.336 1.336

 

For each row, the total column is the sum of yes and no decisions. Half of the total column is called Expected values because there are 2 classes in the decision. It is easy to calculate the chi-squared values based on this table.

For example,

chi-square yes for high humidity is √(( 3– 3.5)2 / 3.5) = 0.267

whereas actual is 3 and expected is 3.5.

So, the chi-square value of the humidity feature is

=  0.267 + 0.267 + 1.336 + 1.336

= 3.207

Now, we will find chi-square values for other features also. The feature having the maximum chi-square value will be the decision point. What about the wind feature?

 

Wind feature

There are two types of the class present in wind columns such that weak and strong. The following table is the below table.

Chaid  Detection wind

Herein, the chi-square test value of the wind feature is

                                = 0.802 + 0.802 + 0 + 0

                                = 1.604

This is less value than the chi-square value of humidity as well. What about the temperature feature?

Temperature feature

There are three types of the class present in temperature columns such that hot, cool and mild. The following table is the below table.

CHAID temp

Herein, the chi-square test value of the temperature feature is

                                         = 0 + 0 + 0.577 + 0.577 + 0.707 + 0.707

                                         = 2.569

This is less value than the chi-square value of humidity and greater than the chi_square value of wind as well. What about the outlook feature?

Outlook feature

There are three types of a class present in temperature columns such that sunny, rain, and overcast. The following table is the below table.

CHAID outlook

Herein, the chi-square test value of the outlook feature is

= 0.316 + 0.316 + 1.414 + 1.414 + 0.316 + 0.316

= 4.092

We have calculated the chi-square values of all features. Let’s see them all at one table.

CHAID chi value

 

As seen, the outlook column has the most elevated and highest chi-square value. This implies that it is the main component feature. Along with these values, we will put this feature to the root node.

CHAID root node

We’ve separated the raw information based on the outlook classes on the illustration above. For instance, the overcast branch simply has a yes decision in the sub informational dataset. This implies that the CHAID tree returns YES if the outlook is overcast.

Both sunny and rain branches have yes and no decisions. We will apply chi-square tests for these sub informational datasets.

Outlook = Sunny branch

This branch has 5 examples. Presently, we search for the most predominant feature. By The Way, we will disregard the outlook feature now since they are altogether the same. At the end of the day, we will find out the most predominant columns among temperature, humidity, and wind.

CHAID outlook

Humidity feature for when the outlook is Sunny

Automatic Interaction Detection humidity

Chi-square value of humidity feature for sunny outlook is

=   1.225 + 1.225 + 1 + 1

= 4.449

Wind feature for when the outlook is Sunny

CHAID Outlook

Chi-square value of wind feature for sunny outlook is

=     0.408 + 0.408 + 0 + 0

= 0.816

Temperature feature for when the outlook is Sunny

Sunny

 

So, the chi-square value of temperature feature for sunny outlook is

=     1 + 1 + 0 + 0 + 0.707 + 0.707

= 3.414

We have found chi-square values for sunny is outlook. Let’s see them all at a table.

sunny is outlook

Presently, humidity is the most predominant feature for the sunny outlook branch. We will put this feature as a decision rule.

decision rule

Presently, both humidity branches for sunny outlook have only one decision as delineated previously. CHAID tree will return NO for sunny outlook and high humidity and it will return YES for sunny outlook and normal humidity.

Rain outlook branch

This branch actually has both yes and no decisions. We need to apply the chi-square test for this branch to find out an accurate decision. This branch has 5 distinct instances as demonstrated in the accompanying sub informational collection dataset. How about we find out the most predominant feature among temperature, humidity and wind.

Rain outlook branch

 

Wind feature for rain outlook

There are two types of a class present in wind feature for rain outlook such that weak and strong.

Wind feature for rain outlook

So, the chi-square value of wind feature for rain outlook is

=     1.225 + 1.225 + 1 + 1

= 4.449

Humidity feature for rain outlook

There are two types of a class present in humidity feature for rain outlook such that high and normal.

Humidity feature for rain outlook

Chi-square value of humidity feature for rain outlook is

=      0 + 0 + 0.408 + 0.408

=    0.816

Temperature feature for rain outlook

There are two types of a class present in temperature features for rain outlook such that mild and cool.

Temperature feature for rain outlook

 

Chi-square value of temperature feature for rain outlook is

= 0 + 0 + 0.408 + 0.408

                                                   = 0.816

We have found all chi-square values for rain is outlook branch. Let’s see them all at a single table.

single table.

Thus, the wind feature is the victor for the rain is the outlook branch. Put this column in the connected branch and see the corresponding sub informational dataset.

informational dataset

As seen, all branches have sub informational datasets having a single decision such that yes or no. In this way, we can generate the CHAID tree as illustrated below.

illustrated below

The final form of the CHAID tree.

Python implementation of a Decision tree using CHAID

from chefboost import Chefboost as cb
import pandas as pd
data = pd.read_csv("/home/kajal/Downloads/weather.csv")
data.head()
data head

config = {"algorithm": "CHAID"}

tree = cb.fit(data, config)

tree

tree

 

# test_instance = ['sunny','hot','high','weak','no']
test_instance = data.iloc[2]

test_instance
tree instance

cb.predict(tree,test_instance)

output:- 'Yes'

#obj[0]: outlook, obj[1]: temperature, obj[2]: humidity, obj[3]: windy
# {"feature": "outlook", "instances": 14, "metric_value": 4.0933, "depth": 1}

def findDecision(obj): 
          if obj[0] == 'rainy':
          # {"feature": " windy", "instances": 5, "metric_value": 4.4495, "depth": 2}
                  if obj[3] == 'weak':
                         return 'yes'
                  elif obj[3] == 'strong':
                         return 'no'
                  else:
                          return 'no'
          elif obj[0] == 'sunny':
           # {"feature": " humidity", "instances": 5, "metric_value": 4.4495, "depth": 2}
                 if obj[2] == 'high':
                        return 'no'
                 elif obj[2] == 'normal':
                         return 'yes'
                 else:
                         return 'yes'
         elif obj[0] == 'overcast':
                      return 'yes'
         else:
                    return 'yes'

Conclusion

Thus, we have created a CHAID decision tree from scratch to end in this post. CHAID uses a chi-square measurement metric to find out the most important feature and apply this recursively until sub informational datasets have a single decision. Even though this is a legacy decision tree algorithm, it is as yet the same process for classification problems.

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion. 

About the Author

Our Top Authors

  • Analytics Vidhya
  • Guest Blog
  • Tavish Srivastava
  • Aishwarya Singh
  • Aniruddha Bhandari
  • Abhishek Sharma
  • Aarshay Jain

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *