๐ CHAID Uses Chi-Square Test — Not Info Gain or Gini
In CHAID (Chi-squared Automatic Interaction Detection), instead of calculating entropy or gain, we calculate a Chi-square statistic to measure how strongly each input (feature) is related to the target (label).
๐ CHAID Splitting Criterion
๐ The attribute with the smallest p-value from the Chi-square test is chosen as the splitting feature.
A small p-value means a strong relationship between the predictor and the target.
๐ Formula Used in CHAID
The Chi-square formula is:
Where:
-
= Chi-square statistic
-
= Observed frequency (what’s in your data)
-
= Expected frequency (if there's no relationship)
-
The sum is over all combinations of feature and class
✅ Example: Mini Chi-square Test
Let’s go back to this dataset:
| Student | Buys Laptop |
|---|---|
| Yes | Yes (3) |
| No | No (3) |
We can fill a 2x2 table:
| Buys = Yes | Buys = No | Total | |
|---|---|---|---|
| Student = Yes | 3 | 0 | 3 |
| Student = No | 0 | 3 | 3 |
| Total | 3 | 3 | 6 |
Step 1: Compute Expected Values
For each cell:
Example:
-
Expected (Student = Yes, Buys = Yes) =
-
Expected (Student = Yes, Buys = No) =
-
Expected (Student = No, Buys = Yes) = 1.5
-
Expected (Student = No, Buys = No) = 1.5
Step 2: Use Chi-square Formula
Now compare this Chi-square statistic with a critical value from the Chi-square table (for degree of freedom = 1).
If it’s greater, we say the relationship is significant → this feature is good for splitting.
๐ Summary: CHAID vs. ID3/C4.5/CART
| Feature | CHAID | ID3 / C4.5 / CART |
|---|---|---|
| Uses what to split? | Chi-square test | Entropy, Info Gain, Gini |
| Data type preferred | Categorical (like Yes/No) | Categorical + Numerical |
| Splits how? | Can have many branches per split | Binary (CART), multi (ID3) |
| Good for | Marketing, social science data | General-purpose decision trees |
Great! Let’s walk through a step-by-step, pen-and-paper-style Chi-square test for a 3x2 table, using simple values so you can calculate everything easily.
๐ฏ GOAL:
We'll apply the Chi-square test to find whether a predictor (like Age Group) is related to a target (like Buys Laptop).
๐งพ Example Dataset
| Person | Age Group | Buys Laptop |
|---|---|---|
| 1 | Young | Yes |
| 2 | Young | No |
| 3 | Middle | Yes |
| 4 | Middle | No |
| 5 | Old | No |
| 6 | Old | Yes |
๐ช Step 1: Build a Frequency Table
Let’s count how many Yes/No for each age group:
| Age Group | Buys = Yes | Buys = No | Row Total |
|---|---|---|---|
| Young | 1 | 1 | 2 |
| Middle | 1 | 1 | 2 |
| Old | 1 | 1 | 2 |
| Col Total | 3 | 3 | 6 |
This is called a contingency table.
๐งฎ Step 2: Calculate Expected Frequencies (E)
Use this formula:
Each cell's expected value:
-
For Young, Yes:
-
For Young, No:
-
Same for Middle and Old.
So, the Expected table (E) is:
| Age Group | Expected Yes | Expected No |
|---|---|---|
| Young | 1 | 1 |
| Middle | 1 | 1 |
| Old | 1 | 1 |
๐งพ Step 3: Apply the Chi-Square Formula
Let’s compute for each cell:
All Observed = Expected (from the frequency table), so:
-
for every cell
-
Therefore,
✅ Final Result:
That means:
There is no relationship between Age Group and Buys Laptop in this dataset — the observed values match the expected.
๐ฏ Now Let’s Try a More Interesting Case
Let’s change the dataset slightly:
| Person | Age Group | Buys Laptop |
|---|---|---|
| 1 | Young | Yes |
| 2 | Young | Yes |
| 3 | Middle | Yes |
| 4 | Middle | No |
| 5 | Old | No |
| 6 | Old | No |
Now build the frequency table:
| Age Group | Buys = Yes | Buys = No | Row Total |
|---|---|---|---|
| Young | 2 | 0 | 2 |
| Middle | 1 | 1 | 2 |
| Old | 0 | 2 | 2 |
| Col Total | 3 | 3 | 6 |
๐ข Step-by-step Expected Values
For each cell:
-
Expected (Young, Yes) =
-
Expected (Young, No) = 1
-
Expected (Middle, Yes) = 1
-
Expected (Middle, No) = 1
-
Expected (Old, Yes) = 1
-
Expected (Old, No) = 1
So:
| Age Group | O (Yes) | E (Yes) | O (No) | E (No) |
|---|---|---|---|---|
| Young | 2 | 1 | 0 | 1 |
| Middle | 1 | 1 | 1 | 1 |
| Old | 0 | 1 | 2 | 1 |
๐ Step 4: Apply the Chi-Square Formula
Now compute:
Let’s go cell by cell:
-
(Young, Yes):
-
(Young, No):
-
(Middle, Yes):
-
(Middle, No):
-
(Old, Yes):
-
(Old, No):
✅ Final Chi-square Value:
๐ What Does This Mean?
To interpret this value, compare with the critical value from the Chi-square table for:
-
Degrees of Freedom = (Rows - 1) × (Cols - 1) = (3 − 1) × (2 − 1) = 2
-
For significance level 0.05 → Critical value = 5.99
Since < 5.99
→ ❌ Not significant at 0.05 level
→ So we do not split based on Age Group
✅ Summary of Steps
| Step | What You Do |
|---|---|
| 1 | Create a frequency table (observed values) |
| 2 | Calculate expected values |
| 3 | Use the formula: |
| 4 | Add up all the values |
| 5 | Compare with critical value to check significance |
No comments:
Post a Comment