Thursday, 4 September 2025

CHAID (Chi-squared Automatic Interaction Detection) made easy

๐Ÿ” CHAID Uses Chi-Square Test — Not Info Gain or Gini

In CHAID (Chi-squared Automatic Interaction Detection), instead of calculating entropy or gain, we calculate a Chi-square statistic to measure how strongly each input (feature) is related to the target (label).


๐Ÿ“Œ CHAID Splitting Criterion

๐Ÿ‘‰ The attribute with the smallest p-value from the Chi-square test is chosen as the splitting feature.

A small p-value means a strong relationship between the predictor and the target.


๐Ÿ“ Formula Used in CHAID

The Chi-square formula is:

ฯ‡2=(OE)2E\chi^2 = \sum \frac{(O - E)^2}{E}

Where:

  • ฯ‡2\chi^2 = Chi-square statistic

  • OO = Observed frequency (what’s in your data)

  • EE = Expected frequency (if there's no relationship)

  • The sum is over all combinations of feature and class


✅ Example: Mini Chi-square Test

Let’s go back to this dataset:

Student Buys Laptop
Yes Yes (3)
No No (3)

We can fill a 2x2 table:

Buys = Yes Buys = No Total
Student = Yes 3 0 3
Student = No 0 3 3
Total 3 3 6

Step 1: Compute Expected Values

For each cell:

E=Row Total×Column TotalGrand TotalE = \frac{\text{Row Total} \times \text{Column Total}}{\text{Grand Total}}

Example:

  • Expected (Student = Yes, Buys = Yes) = 3×36=1.5\frac{3 × 3}{6} = 1.5

  • Expected (Student = Yes, Buys = No) = 3×36=1.5\frac{3 × 3}{6} = 1.5

  • Expected (Student = No, Buys = Yes) = 1.5

  • Expected (Student = No, Buys = No) = 1.5

Step 2: Use Chi-square Formula

ฯ‡2=(OE)2E\chi^2 = \sum \frac{(O - E)^2}{E} =(31.5)21.5+(01.5)21.5+(01.5)21.5+(31.5)21.5= \frac{(3 - 1.5)^2}{1.5} + \frac{(0 - 1.5)^2}{1.5} + \frac{(0 - 1.5)^2}{1.5} + \frac{(3 - 1.5)^2}{1.5} =2.251.5+2.251.5+2.251.5+2.251.5=4×1.5=6.0= \frac{2.25}{1.5} + \frac{2.25}{1.5} + \frac{2.25}{1.5} + \frac{2.25}{1.5} = 4 \times 1.5 = 6.0

Now compare this Chi-square statistic with a critical value from the Chi-square table (for degree of freedom = 1).
If it’s greater, we say the relationship is significant → this feature is good for splitting.


๐Ÿ“ Summary: CHAID vs. ID3/C4.5/CART

Feature CHAID ID3 / C4.5 / CART
Uses what to split? Chi-square test Entropy, Info Gain, Gini
Data type preferred Categorical (like Yes/No) Categorical + Numerical
Splits how? Can have many branches per split Binary (CART), multi (ID3)
Good for Marketing, social science data General-purpose decision trees

Great! Let’s walk through a step-by-step, pen-and-paper-style Chi-square test for a 3x2 table, using simple values so you can calculate everything easily.


๐ŸŽฏ GOAL:

We'll apply the Chi-square test to find whether a predictor (like Age Group) is related to a target (like Buys Laptop).


๐Ÿงพ Example Dataset

Person Age Group Buys Laptop
1 Young Yes
2 Young No
3 Middle Yes
4 Middle No
5 Old No
6 Old Yes

๐Ÿชœ Step 1: Build a Frequency Table

Let’s count how many Yes/No for each age group:

Age Group Buys = Yes Buys = No Row Total
Young 1 1 2
Middle 1 1 2
Old 1 1 2
Col Total 3 3 6

This is called a contingency table.


๐Ÿงฎ Step 2: Calculate Expected Frequencies (E)

Use this formula:

E=Row Total×Column TotalGrand TotalE = \frac{\text{Row Total} \times \text{Column Total}}{\text{Grand Total}}

Each cell's expected value:

  • For Young, Yes: 2×36=1\frac{2 \times 3}{6} = 1

  • For Young, No: 2×36=1\frac{2 \times 3}{6} = 1

  • Same for Middle and Old.

So, the Expected table (E) is:

Age Group Expected Yes Expected No
Young 1 1
Middle 1 1
Old 1 1

๐Ÿงพ Step 3: Apply the Chi-Square Formula

ฯ‡2=(OE)2E\chi^2 = \sum \frac{(O - E)^2}{E}

Let’s compute for each cell:

All Observed = Expected (from the frequency table), so:

  • (OE)2=0(O - E)^2 = 0 for every cell

  • Therefore, ฯ‡2=0\chi^2 = 0

✅ Final Result:

ฯ‡2=0\chi^2 = 0

That means:
There is no relationship between Age Group and Buys Laptop in this dataset — the observed values match the expected.


๐ŸŽฏ Now Let’s Try a More Interesting Case

Let’s change the dataset slightly:

Person Age Group Buys Laptop
1 Young Yes
2 Young Yes
3 Middle Yes
4 Middle No
5 Old No
6 Old No

Now build the frequency table:

Age Group Buys = Yes Buys = No Row Total
Young 2 0 2
Middle 1 1 2
Old 0 2 2
Col Total 3 3 6

๐Ÿ”ข Step-by-step Expected Values

For each cell:

  • Expected (Young, Yes) = 2×36=1\frac{2 × 3}{6} = 1

  • Expected (Young, No) = 1

  • Expected (Middle, Yes) = 1

  • Expected (Middle, No) = 1

  • Expected (Old, Yes) = 1

  • Expected (Old, No) = 1

So:

Age Group O (Yes) E (Yes) O (No) E (No)
Young 2 1 0 1
Middle 1 1 1 1
Old 0 1 2 1

๐Ÿ” Step 4: Apply the Chi-Square Formula

Now compute:

ฯ‡2=(OE)2E\chi^2 = \sum \frac{(O - E)^2}{E}

Let’s go cell by cell:

  • (Young, Yes): (21)21=1\frac{(2 - 1)^2}{1} = 1

  • (Young, No): (01)21=1\frac{(0 - 1)^2}{1} = 1

  • (Middle, Yes): (11)21=0\frac{(1 - 1)^2}{1} = 0

  • (Middle, No): (11)21=0\frac{(1 - 1)^2}{1} = 0

  • (Old, Yes): (01)21=1\frac{(0 - 1)^2}{1} = 1

  • (Old, No): (21)21=1\frac{(2 - 1)^2}{1} = 1

✅ Final Chi-square Value:

ฯ‡2=1+1+0+0+1+1=4\chi^2 = 1 + 1 + 0 + 0 + 1 + 1 = \boxed{4}


๐Ÿ“Š What Does This Mean?

To interpret this value, compare with the critical value from the Chi-square table for:

  • Degrees of Freedom = (Rows - 1) × (Cols - 1) = (3 − 1) × (2 − 1) = 2

  • For significance level 0.05 → Critical value = 5.99

Since ฯ‡2=4\chi^2 = 4 < 5.99
→ ❌ Not significant at 0.05 level
→ So we do not split based on Age Group


✅ Summary of Steps

Step What You Do
1 Create a frequency table (observed values)
2 Calculate expected values
3 Use the formula: ฯ‡2=(OE)2E\chi^2 = \sum \frac{(O - E)^2}{E}
4 Add up all the values
5 Compare with critical value to check significance

No comments:

Post a Comment