Technology Blog : CHAID (Chi-squared Automatic Interaction Detection) made easy

🔍 CHAID Uses Chi-Square Test — Not Info Gain or Gini

In CHAID (Chi-squared Automatic Interaction Detection), instead of calculating entropy or gain, we calculate a Chi-square statistic to measure how strongly each input (feature) is related to the target (label).

📌 CHAID Splitting Criterion

👉 The attribute with the smallest p-value from the Chi-square test is chosen as the splitting feature.

A small p-value means a strong relationship between the predictor and the target.

📐 Formula Used in CHAID

The Chi-square formula is:

\chi^2 = \sum \frac{(O - E)^2}{E}

Where:

$\chi^2$ = Chi-square statistic
$O$ = Observed frequency (what’s in your data)
$E$ = Expected frequency (if there's no relationship)
The sum is over all combinations of feature and class

✅ Example: Mini Chi-square Test

Let’s go back to this dataset:

Student	Buys Laptop
Yes	Yes (3)
No	No (3)

We can fill a 2x2 table:

	Buys = Yes	Buys = No	Total
Student = Yes	3	0	3
Student = No	0	3	3
Total	3	3	6

Step 1: Compute Expected Values

For each cell:

E = \frac{\text{Row Total} \times \text{Column Total}}{\text{Grand Total}}

Example:

Expected (Student = Yes, Buys = Yes) = $\frac{3 × 3}{6} = 1.5$
Expected (Student = Yes, Buys = No) = $\frac{3 × 3}{6} = 1.5$
Expected (Student = No, Buys = Yes) = 1.5
Expected (Student = No, Buys = No) = 1.5

Step 2: Use Chi-square Formula

\chi^2 = \sum \frac{(O - E)^2}{E}

= \frac{(3 - 1.5)^2}{1.5} + \frac{(0 - 1.5)^2}{1.5} + \frac{(0 - 1.5)^2}{1.5} + \frac{(3 - 1.5)^2}{1.5}

= \frac{2.25}{1.5} + \frac{2.25}{1.5} + \frac{2.25}{1.5} + \frac{2.25}{1.5} = 4 \times 1.5 = 6.0

Now compare this Chi-square statistic with a critical value from the Chi-square table (for degree of freedom = 1).
If it’s greater, we say the relationship is significant → this feature is good for splitting.

📝 Summary: CHAID vs. ID3/C4.5/CART

Feature	CHAID	ID3 / C4.5 / CART
Uses what to split?	Chi-square test	Entropy, Info Gain, Gini
Data type preferred	Categorical (like Yes/No)	Categorical + Numerical
Splits how?	Can have many branches per split	Binary (CART), multi (ID3)
Good for	Marketing, social science data	General-purpose decision trees

Great! Let’s walk through a step-by-step, pen-and-paper-style Chi-square test for a 3x2 table, using simple values so you can calculate everything easily.

🎯 GOAL:

We'll apply the Chi-square test to find whether a predictor (like Age Group) is related to a target (like Buys Laptop).

🧾 Example Dataset

Person	Age Group	Buys Laptop
1	Young	Yes
2	Young	No
3	Middle	Yes
4	Middle	No
5	Old	No
6	Old	Yes

🪜 Step 1: Build a Frequency Table

Let’s count how many Yes/No for each age group:

Age Group	Buys = Yes	Buys = No	Row Total
Young	1	1	2
Middle	1	1	2
Old	1	1	2
Col Total	3	3	6

This is called a contingency table.

🧮 Step 2: Calculate Expected Frequencies (E)

Use this formula:

$E = \frac{\text{Row Total} \times \text{Column Total}}{\text{Grand Total}}$

Each cell's expected value:

For Young, Yes: $\frac{2 \times 3}{6} = 1$
For Young, No: $\frac{2 \times 3}{6} = 1$
Same for Middle and Old.

So, the Expected table (E) is:

Age Group	Expected Yes	Expected No
Young	1	1
Middle	1	1
Old	1	1

🧾 Step 3: Apply the Chi-Square Formula

$\chi^2 = \sum \frac{(O - E)^2}{E}$

Let’s compute for each cell:

All Observed = Expected (from the frequency table), so:

$(O - E)^2 = 0$ for every cell
Therefore, $\chi^2 = 0$

✅ Final Result:

$\chi^2 = 0$

That means:
There is no relationship between Age Group and Buys Laptop in this dataset — the observed values match the expected.

🎯 Now Let’s Try a More Interesting Case

Let’s change the dataset slightly:

Person	Age Group	Buys Laptop
1	Young	Yes
2	Young	Yes
3	Middle	Yes
4	Middle	No
5	Old	No
6	Old	No

Now build the frequency table:

Age Group	Buys = Yes	Buys = No	Row Total
Young	2	0	2
Middle	1	1	2
Old	0	2	2
Col Total	3	3	6

🔢 Step-by-step Expected Values

For each cell:

Expected (Young, Yes) = $\frac{2 × 3}{6} = 1$
Expected (Young, No) = 1
Expected (Middle, Yes) = 1
Expected (Middle, No) = 1
Expected (Old, Yes) = 1
Expected (Old, No) = 1

So:

Age Group	O (Yes)	E (Yes)	O (No)	E (No)
Young	2	1	0	1
Middle	1	1	1	1
Old	0	1	2	1

🔍 Step 4: Apply the Chi-Square Formula

Now compute:

$\chi^2 = \sum \frac{(O - E)^2}{E}$

Let’s go cell by cell:

(Young, Yes): $\frac{(2 - 1)^2}{1} = 1$
(Young, No): $\frac{(0 - 1)^2}{1} = 1$
(Middle, Yes): $\frac{(1 - 1)^2}{1} = 0$
(Middle, No): $\frac{(1 - 1)^2}{1} = 0$
(Old, Yes): $\frac{(0 - 1)^2}{1} = 1$
(Old, No): $\frac{(2 - 1)^2}{1} = 1$

✅ Final Chi-square Value:

$\chi^2 = 1 + 1 + 0 + 0 + 1 + 1 = \boxed{4}$

📊 What Does This Mean?

To interpret this value, compare with the critical value from the Chi-square table for:

Degrees of Freedom = (Rows - 1) × (Cols - 1) = (3 − 1) × (2 − 1) = 2
For significance level 0.05 → Critical value = 5.99

Since $\chi^2 = 4$ < 5.99
→ ❌ Not significant at 0.05 level
→ So we do not split based on Age Group

✅ Summary of Steps

Step	What You Do
1	Create a frequency table (observed values)
2	Calculate expected values
3	Use the formula: $\chi^2 = \sum \frac{(O - E)^2}{E}$
4	Add up all the values
5	Compare with critical value to check significance

Technology Blog

Search This Blog

Pages

Thursday, 4 September 2025

CHAID (Chi-squared Automatic Interaction Detection) made easy