๐ CHAID Uses Chi-Square Test — Not Info Gain or Gini
In CHAID (Chi-squared Automatic Interaction Detection), instead of calculating entropy or gain, we calculate a Chi-square statistic to measure how strongly each input (feature) is related to the target (label).
๐ CHAID Splitting Criterion
๐ The attribute with the smallest p-value from the Chi-square test is chosen as the splitting feature.
A small p-value means a strong relationship between the predictor and the target.
๐ Formula Used in CHAID
The Chi-square formula is:
Where:
-
= Chi-square statistic
-
= Observed frequency (what’s in your data)
-
= Expected frequency (if there's no relationship)
-
The sum is over all combinations of feature and class
✅ Example: Mini Chi-square Test
Let’s go back to this dataset:
Student | Buys Laptop |
---|---|
Yes | Yes (3) |
No | No (3) |
We can fill a 2x2 table:
Buys = Yes | Buys = No | Total | |
---|---|---|---|
Student = Yes | 3 | 0 | 3 |
Student = No | 0 | 3 | 3 |
Total | 3 | 3 | 6 |
Step 1: Compute Expected Values
For each cell:
Example:
-
Expected (Student = Yes, Buys = Yes) =
-
Expected (Student = Yes, Buys = No) =
-
Expected (Student = No, Buys = Yes) = 1.5
-
Expected (Student = No, Buys = No) = 1.5
Step 2: Use Chi-square Formula
Now compare this Chi-square statistic with a critical value from the Chi-square table (for degree of freedom = 1).
If it’s greater, we say the relationship is significant → this feature is good for splitting.
๐ Summary: CHAID vs. ID3/C4.5/CART
Feature | CHAID | ID3 / C4.5 / CART |
---|---|---|
Uses what to split? | Chi-square test | Entropy, Info Gain, Gini |
Data type preferred | Categorical (like Yes/No) | Categorical + Numerical |
Splits how? | Can have many branches per split | Binary (CART), multi (ID3) |
Good for | Marketing, social science data | General-purpose decision trees |
Great! Let’s walk through a step-by-step, pen-and-paper-style Chi-square test for a 3x2 table, using simple values so you can calculate everything easily.
๐ฏ GOAL:
We'll apply the Chi-square test to find whether a predictor (like Age Group) is related to a target (like Buys Laptop).
๐งพ Example Dataset
Person | Age Group | Buys Laptop |
---|---|---|
1 | Young | Yes |
2 | Young | No |
3 | Middle | Yes |
4 | Middle | No |
5 | Old | No |
6 | Old | Yes |
๐ช Step 1: Build a Frequency Table
Let’s count how many Yes/No for each age group:
Age Group | Buys = Yes | Buys = No | Row Total |
---|---|---|---|
Young | 1 | 1 | 2 |
Middle | 1 | 1 | 2 |
Old | 1 | 1 | 2 |
Col Total | 3 | 3 | 6 |
This is called a contingency table.
๐งฎ Step 2: Calculate Expected Frequencies (E)
Use this formula:
Each cell's expected value:
-
For Young, Yes:
-
For Young, No:
-
Same for Middle and Old.
So, the Expected table (E) is:
Age Group | Expected Yes | Expected No |
---|---|---|
Young | 1 | 1 |
Middle | 1 | 1 |
Old | 1 | 1 |
๐งพ Step 3: Apply the Chi-Square Formula
Let’s compute for each cell:
All Observed = Expected (from the frequency table), so:
-
for every cell
-
Therefore,
✅ Final Result:
That means:
There is no relationship between Age Group and Buys Laptop in this dataset — the observed values match the expected.
๐ฏ Now Let’s Try a More Interesting Case
Let’s change the dataset slightly:
Person | Age Group | Buys Laptop |
---|---|---|
1 | Young | Yes |
2 | Young | Yes |
3 | Middle | Yes |
4 | Middle | No |
5 | Old | No |
6 | Old | No |
Now build the frequency table:
Age Group | Buys = Yes | Buys = No | Row Total |
---|---|---|---|
Young | 2 | 0 | 2 |
Middle | 1 | 1 | 2 |
Old | 0 | 2 | 2 |
Col Total | 3 | 3 | 6 |
๐ข Step-by-step Expected Values
For each cell:
-
Expected (Young, Yes) =
-
Expected (Young, No) = 1
-
Expected (Middle, Yes) = 1
-
Expected (Middle, No) = 1
-
Expected (Old, Yes) = 1
-
Expected (Old, No) = 1
So:
Age Group | O (Yes) | E (Yes) | O (No) | E (No) |
---|---|---|---|---|
Young | 2 | 1 | 0 | 1 |
Middle | 1 | 1 | 1 | 1 |
Old | 0 | 1 | 2 | 1 |
๐ Step 4: Apply the Chi-Square Formula
Now compute:
Let’s go cell by cell:
-
(Young, Yes):
-
(Young, No):
-
(Middle, Yes):
-
(Middle, No):
-
(Old, Yes):
-
(Old, No):
✅ Final Chi-square Value:
๐ What Does This Mean?
To interpret this value, compare with the critical value from the Chi-square table for:
-
Degrees of Freedom = (Rows - 1) × (Cols - 1) = (3 − 1) × (2 − 1) = 2
-
For significance level 0.05 → Critical value = 5.99
Since < 5.99
→ ❌ Not significant at 0.05 level
→ So we do not split based on Age Group
✅ Summary of Steps
Step | What You Do |
---|---|
1 | Create a frequency table (observed values) |
2 | Calculate expected values |
3 | Use the formula: |
4 | Add up all the values |
5 | Compare with critical value to check significance |
No comments:
Post a Comment