Thursday, 4 September 2025

Predictive and advanced Analytics Using R notes Krishna University B.Sc Data Science Honours

Data Caching (In-Memory Data Management for Efficiency)

Imagine you're a shopkeeper who wants to quickly answer questions like:

  • What items are sold together?

  • What are the best-selling combinations?

  • How to group customers?

To answer these quickly, you don’t want to go back to your storeroom (disk) every time — instead, you want to keep useful data in memory (brain or register).

This is what in-memory data caching helps with in data mining — it avoids slow disk operations.


๐Ÿง  Concept 1: Data Caching (In-Memory Data Management)

Definition: Storing frequently accessed or preprocessed data in memory (RAM) to avoid reading from disk repeatedly, making the process faster.


Now let’s explain the 6 common in-memory caching techniques using shop examples and simple data mining analogies:


✅ 1. Data Cube Materialization

What it is: Pre-calculating and storing summary tables (cuboids) for different combinations of dimensions (like "item", "month", "region").

Analogy:
You create ready-made sales summaries for:

  • Item + Region

  • Item + Month

  • Region + Month

So when someone asks:

“How many T-shirts were sold in June in Delhi?”
You don't have to add up everything — it's already precomputed and stored in memory.

Use: Makes OLAP queries super fast.


✅ 2. Vertical Data Format (TID Lists) — used in Eclat Algorithm

What it is: Instead of storing full transactions, you store:

  • Each item → list of transaction IDs (TIDs) where it appears

Analogy:
Instead of:

T1 → Milk, Bread  
T2 → Milk, Eggs  

You store:

Milk → T1, T2  
Bread → T1  
Eggs → T2

Now to find common items:

  • Milk ∩ Bread → T1

  • Milk ∩ Eggs → T2

Challenge: These TID lists can be large → so we split or partition them into smaller sets to keep them in memory.


✅ 3. FP-Tree (Frequent Pattern Tree)

What it is: A compressed tree structure that stores frequent itemsets without listing every transaction.

Analogy:
Instead of remembering every customer's shopping list, you draw a tree:

[Milk, Bread]  
[Milk, Eggs]  
[Milk, Bread, Eggs]

→ becomes a tree:

Milk  
 └── Bread  
     └── Eggs  
 └── Eggs

Use: Makes pattern mining faster and saves space.


✅ 4. Projected Databases

What it is: Smaller sub-databases focused only on frequent items — built during recursion in algorithms like PrefixSpan.

Analogy:
If you’re analyzing "customers who bought Milk", you ignore others and only use a filtered copy of the database with Milk-based transactions.

Why in memory?
To avoid reading filtered data from disk again and again during recursive calls.


✅ 5. Partitioned Ensembles

What it is: Split the full transaction data into small pieces that fit into RAM, and process them one by one.

Analogy:
If your notebook is too big to read at once, you tear out 10 pages at a time, work on them, and stitch the results.

Use: Especially helpful when memory is limited.


✅ 6. AVC-Sets (Attribute-Value-Class Sets)

What it is: For each attribute (feature), store a summary of how many times each value appears with each class (label). Used in decision tree building.

Analogy:
You’re building a tree to predict "Will customer return?"
You keep in memory:

Age Buys Returns?
20s Yes 4 times
30s No 6 times
20s No 1 time

So you don’t have to scan full data again to calculate "best split".


๐Ÿ“Š Summary Table

Technique What It Does Simple Analogy
Data Cube Precompute summaries Ready-made total sales by category
TID Lists (Eclat) Store item → transaction IDs Item lookup book
FP-Tree Compress frequent items into a tree Combine repeated paths like Milk → Bread
Projected DB Use filtered, smaller datasets Only analyze “Milk buyers” group
Partitioned Ensembles Split DB into memory-sized chunks Tear out a few pages at a time
AVC-Sets Store class counts per feature-value Summary table for decision trees

๐Ÿš€ Final Thoughts

All of these techniques are about managing large data in small memory by:

  • Preprocessing

  • Compressing

  • Partitioning

  • Avoiding disk reads

This is very important in data mining where speed matters and datasets can be huge.


Would you like a visual version (PDF/diagram) of this explanation or a real dataset example applying FP-Tree or Eclat?

CHAID (Chi-squared Automatic Interaction Detection) made easy

๐Ÿ” CHAID Uses Chi-Square Test — Not Info Gain or Gini

In CHAID (Chi-squared Automatic Interaction Detection), instead of calculating entropy or gain, we calculate a Chi-square statistic to measure how strongly each input (feature) is related to the target (label).


๐Ÿ“Œ CHAID Splitting Criterion

๐Ÿ‘‰ The attribute with the smallest p-value from the Chi-square test is chosen as the splitting feature.

A small p-value means a strong relationship between the predictor and the target.


๐Ÿ“ Formula Used in CHAID

The Chi-square formula is:

ฯ‡2=(OE)2E\chi^2 = \sum \frac{(O - E)^2}{E}

Where:

  • ฯ‡2\chi^2 = Chi-square statistic

  • OO = Observed frequency (what’s in your data)

  • EE = Expected frequency (if there's no relationship)

  • The sum is over all combinations of feature and class


✅ Example: Mini Chi-square Test

Let’s go back to this dataset:

Student Buys Laptop
Yes Yes (3)
No No (3)

We can fill a 2x2 table:

Buys = Yes Buys = No Total
Student = Yes 3 0 3
Student = No 0 3 3
Total 3 3 6

Step 1: Compute Expected Values

For each cell:

E=Row Total×Column TotalGrand TotalE = \frac{\text{Row Total} \times \text{Column Total}}{\text{Grand Total}}

Example:

  • Expected (Student = Yes, Buys = Yes) = 3×36=1.5\frac{3 × 3}{6} = 1.5

  • Expected (Student = Yes, Buys = No) = 3×36=1.5\frac{3 × 3}{6} = 1.5

  • Expected (Student = No, Buys = Yes) = 1.5

  • Expected (Student = No, Buys = No) = 1.5

Step 2: Use Chi-square Formula

ฯ‡2=(OE)2E\chi^2 = \sum \frac{(O - E)^2}{E} =(31.5)21.5+(01.5)21.5+(01.5)21.5+(31.5)21.5= \frac{(3 - 1.5)^2}{1.5} + \frac{(0 - 1.5)^2}{1.5} + \frac{(0 - 1.5)^2}{1.5} + \frac{(3 - 1.5)^2}{1.5} =2.251.5+2.251.5+2.251.5+2.251.5=4×1.5=6.0= \frac{2.25}{1.5} + \frac{2.25}{1.5} + \frac{2.25}{1.5} + \frac{2.25}{1.5} = 4 \times 1.5 = 6.0

Now compare this Chi-square statistic with a critical value from the Chi-square table (for degree of freedom = 1).
If it’s greater, we say the relationship is significant → this feature is good for splitting.


๐Ÿ“ Summary: CHAID vs. ID3/C4.5/CART

Feature CHAID ID3 / C4.5 / CART
Uses what to split? Chi-square test Entropy, Info Gain, Gini
Data type preferred Categorical (like Yes/No) Categorical + Numerical
Splits how? Can have many branches per split Binary (CART), multi (ID3)
Good for Marketing, social science data General-purpose decision trees

Great! Let’s walk through a step-by-step, pen-and-paper-style Chi-square test for a 3x2 table, using simple values so you can calculate everything easily.


๐ŸŽฏ GOAL:

We'll apply the Chi-square test to find whether a predictor (like Age Group) is related to a target (like Buys Laptop).


๐Ÿงพ Example Dataset

Person Age Group Buys Laptop
1 Young Yes
2 Young No
3 Middle Yes
4 Middle No
5 Old No
6 Old Yes

๐Ÿชœ Step 1: Build a Frequency Table

Let’s count how many Yes/No for each age group:

Age Group Buys = Yes Buys = No Row Total
Young 1 1 2
Middle 1 1 2
Old 1 1 2
Col Total 3 3 6

This is called a contingency table.


๐Ÿงฎ Step 2: Calculate Expected Frequencies (E)

Use this formula:

E=Row Total×Column TotalGrand TotalE = \frac{\text{Row Total} \times \text{Column Total}}{\text{Grand Total}}

Each cell's expected value:

  • For Young, Yes: 2×36=1\frac{2 \times 3}{6} = 1

  • For Young, No: 2×36=1\frac{2 \times 3}{6} = 1

  • Same for Middle and Old.

So, the Expected table (E) is:

Age Group Expected Yes Expected No
Young 1 1
Middle 1 1
Old 1 1

๐Ÿงพ Step 3: Apply the Chi-Square Formula

ฯ‡2=(OE)2E\chi^2 = \sum \frac{(O - E)^2}{E}

Let’s compute for each cell:

All Observed = Expected (from the frequency table), so:

  • (OE)2=0(O - E)^2 = 0 for every cell

  • Therefore, ฯ‡2=0\chi^2 = 0

✅ Final Result:

ฯ‡2=0\chi^2 = 0

That means:
There is no relationship between Age Group and Buys Laptop in this dataset — the observed values match the expected.


๐ŸŽฏ Now Let’s Try a More Interesting Case

Let’s change the dataset slightly:

Person Age Group Buys Laptop
1 Young Yes
2 Young Yes
3 Middle Yes
4 Middle No
5 Old No
6 Old No

Now build the frequency table:

Age Group Buys = Yes Buys = No Row Total
Young 2 0 2
Middle 1 1 2
Old 0 2 2
Col Total 3 3 6

๐Ÿ”ข Step-by-step Expected Values

For each cell:

  • Expected (Young, Yes) = 2×36=1\frac{2 × 3}{6} = 1

  • Expected (Young, No) = 1

  • Expected (Middle, Yes) = 1

  • Expected (Middle, No) = 1

  • Expected (Old, Yes) = 1

  • Expected (Old, No) = 1

So:

Age Group O (Yes) E (Yes) O (No) E (No)
Young 2 1 0 1
Middle 1 1 1 1
Old 0 1 2 1

๐Ÿ” Step 4: Apply the Chi-Square Formula

Now compute:

ฯ‡2=(OE)2E\chi^2 = \sum \frac{(O - E)^2}{E}

Let’s go cell by cell:

  • (Young, Yes): (21)21=1\frac{(2 - 1)^2}{1} = 1

  • (Young, No): (01)21=1\frac{(0 - 1)^2}{1} = 1

  • (Middle, Yes): (11)21=0\frac{(1 - 1)^2}{1} = 0

  • (Middle, No): (11)21=0\frac{(1 - 1)^2}{1} = 0

  • (Old, Yes): (01)21=1\frac{(0 - 1)^2}{1} = 1

  • (Old, No): (21)21=1\frac{(2 - 1)^2}{1} = 1

✅ Final Chi-square Value:

ฯ‡2=1+1+0+0+1+1=4\chi^2 = 1 + 1 + 0 + 0 + 1 + 1 = \boxed{4}


๐Ÿ“Š What Does This Mean?

To interpret this value, compare with the critical value from the Chi-square table for:

  • Degrees of Freedom = (Rows - 1) × (Cols - 1) = (3 − 1) × (2 − 1) = 2

  • For significance level 0.05 → Critical value = 5.99

Since ฯ‡2=4\chi^2 = 4 < 5.99
→ ❌ Not significant at 0.05 level
→ So we do not split based on Age Group


✅ Summary of Steps

Step What You Do
1 Create a frequency table (observed values)
2 Calculate expected values
3 Use the formula: ฯ‡2=(OE)2E\chi^2 = \sum \frac{(O - E)^2}{E}
4 Add up all the values
5 Compare with critical value to check significance

Support Vector Machine Simplified

We’ll use a tiny dataset with just 4 points. The goal is to separate Class A (+1) and Class B (−1) using a straight line (in 2D).


๐Ÿ“Œ Dataset

Point x₁ x₂ Class
A 1 2 +1
B 2 3 +1
C 3 3 −1
D 2 1 −1

Plot these on graph paper:

  • A (1,2) ๐Ÿ”ต

  • B (2,3) ๐Ÿ”ต

  • C (3,3) ๐Ÿ”ด

  • D (2,1) ๐Ÿ”ด


✏️ Step 1: Try a line x₂ = x₁ → i.e., line through origin at 45°

The equation of the line is:

f(x)=x2x1=0f(x) = x₂ - x₁ = 0

Let’s test each point:

Point x₁ x₂ f(x) = x₂ - x₁ Result Prediction
A 1 2 2 - 1 = +1 ≥ 0 +1 ✅
B 2 3 3 - 2 = +1 ≥ 0 +1 ✅
C 3 3 3 - 3 = 0 ≥ 0 +1 ❌
D 2 1 1 - 2 = -1 < 0 -1 ✅

❌ C is wrongly classified. So this line isn’t optimal.


✏️ Step 2: Try a better line: x₂ = x₁ + 0.5

This shifts the line upward a bit.
The equation becomes:

f(x)=x2x10.5=0f(x) = x₂ - x₁ - 0.5 = 0

Let’s test:

Point x₁ x₂ f(x) = x₂ - x₁ - 0.5 Result Prediction
A 1 2 2 - 1 - 0.5 = +0.5 ≥ 0 +1 ✅
B 2 3 3 - 2 - 0.5 = +0.5 ≥ 0 +1 ✅
C 3 3 3 - 3 - 0.5 = -0.5 < 0 -1 ✅
D 2 1 1 - 2 - 0.5 = -1.5 < 0 -1 ✅

✅ All 4 points are correctly classified!


๐Ÿงฎ Step 3: Express the Equation as SVM Style

SVM wants the line in this form:

w1x1+w2x2+b=0w₁·x₁ + w₂·x₂ + b = 0

Our equation:

x2x10.5=0x₂ - x₁ - 0.5 = 0

Can be rewritten as:

x1+x20.5=0-x₁ + x₂ - 0.5 = 0

So,

  • w₁ = -1

  • w₂ = +1

  • b = -0.5

This is our final separating hyperplane.


๐Ÿงฒ Step 4: Margin and Support Vectors

The support vectors are the closest points to the decision boundary.

Check distances from the line:

For point A(1,2):

f(x)=210.5=0.5f(x) = 2 - 1 - 0.5 = 0.5

Point C(3,3):

f(x)=330.5=0.5f(x) = 3 - 3 - 0.5 = -0.5

So points A and C are support vectors — they sit at equal distances (margin) from the decision boundary.


✅ Final Summary (like a notebook page)

๐Ÿ“˜ Final Equation:

x2x10.5=0x₂ - x₁ - 0.5 = 0

or in SVM form:

w=[1,1],b=0.5w = [-1, 1], \quad b = -0.5

๐Ÿ“ Support Vectors:

  • A(1,2)

  • C(3,3)

✅ Classification Rule:

  • If f(x) ≥ 0 → Class +1

  • If f(x) < 0 → Class -1


๐ŸŽ“ SVM in 1 Sentence:

SVM finds the best line (or curve) that maximizes the gap between two classes, using only the closest points (support vectors) to make the decision.




๐ŸŽฏ GOAL of SVM (in Math Terms)

Given labeled data, find the hyperplane (line) that:

  1. Separates the two classes correctly

  2. Maximizes the margin (distance from the line to the closest points)


✍️ 1. The Equation of a Hyperplane

In 2D, a line is:

w1x1+w2x2+b=0w_1x_1 + w_2x_2 + b = 0

Or, in vector form:

wx+b=0\mathbf{w}^\top \mathbf{x} + b = 0
  • w=[w1,w2]\mathbf{w} = [w_1, w_2] → weight vector (controls the direction of the line)

  • bb → bias (controls the shift up/down of the line)

  • x=[x1,x2]\mathbf{x} = [x_1, x_2] → input point


๐Ÿง  2. Classification Rule

For any point x\mathbf{x}:

Class={+1if wx+b01if wx+b<0\text{Class} = \begin{cases} +1 & \text{if } \mathbf{w}^\top \mathbf{x} + b \geq 0 \\ -1 & \text{if } \mathbf{w}^\top \mathbf{x} + b < 0 \end{cases}

๐Ÿ“ 3. What is Margin?

Let’s say you have a line that separates the data. The margin is the distance between the line and the closest data points (called support vectors).

We want this margin to be as wide as possible.

Let’s define:

  • The distance from a point x\mathbf{x} to the line wx+b=0\mathbf{w}^\top \mathbf{x} + b = 0 is:

Distance=wx+bw\text{Distance} = \frac{|\mathbf{w}^\top \mathbf{x} + b|}{\|\mathbf{w}\|}

Where w=w12+w22\|\mathbf{w}\| = \sqrt{w_1^2 + w_2^2}


๐Ÿ 4. Optimization Objective

We want:

  • All data points classified correctly:

    yi(wxi+b)1y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1

    for all ii

    This ensures the points are on the correct side of the margin.

  • Maximize the margin = Minimize w\|\mathbf{w}\|

So the optimization problem becomes:

Minimize:

12w2\frac{1}{2} \|\mathbf{w}\|^2

Subject to:

yi(wxi+b)1for all iy_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 \quad \text{for all } i

This is called a convex optimization problem — it has one global minimum, which we can find using Lagrange Multipliers.


๐Ÿงฉ 5. Solving Using Lagrangian (Soft Explanation)

We use the method of Lagrange Multipliers to solve this constrained optimization.

We build the Lagrangian:

L(w,b,ฮฑ)=12w2i=1nฮฑi[yi(wxi+b)1]L(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2} \|\mathbf{w}\|^2 - \sum_{i=1}^n \alpha_i [y_i(\mathbf{w}^\top \mathbf{x}_i + b) - 1]

Where:

  • ฮฑi0\alpha_i \geq 0 are the Lagrange multipliers

Then we find the saddle point (minimize LL w.r.t w,b\mathbf{w}, b and maximize w.r.t ฮฑ\alpha).

This leads to a dual problem, which is easier to solve using tools like quadratic programming.


✳️ 6. Final Classifier

Once solved, we get:

w=i=1nฮฑiyixi\mathbf{w} = \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i

This means the support vectors (where ฮฑi>0\alpha_i > 0) are the only ones used to define w\mathbf{w}. All other data points don’t affect the boundary!

Then you get the decision function:

f(x)=wx+bf(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b

Predict class:

  • If f(x)0f(\mathbf{x}) \geq 0 → +1

  • If f(x)<0f(\mathbf{x}) < 0 → −1


๐Ÿช„ Intuition Summary

Concept In Simple Words
Hyperplane The best line that separates classes
Margin Gap between the line and the nearest points
Support Vectors Points lying closest to the line
Optimization Goal Maximize margin (i.e., minimize w\|\mathbf{w}\|)
Constraint Keep all points on the correct side
Lagrange Method A tool to solve optimization with constraints