Wednesday, 24 September 2025

Predictive and Advanced Analytics using R lab

Experiment 1

Aim: To perform simple and multiple linear regression using R. This means we want to create a mathematical model that predicts one value (like how far a car can drive on a gallon of gas) based on one or more other values (like the car's weight or power), using a straight line that best fits the data points.

Explanation of the Dataset (First Time Used in Experiments 1-9): The dataset we are using here is called "mtcars." Imagine you have a list of 32 different cars from the 1970s, like a Ford or a Toyota. For each car, we have information about things like how many miles it can drive per gallon of gas (that's "mpg," our main thing to predict), how heavy it is ("wt" for weight in thousands of pounds), how powerful the engine is ("hp" for horsepower), and other details like number of cylinders or gears. This data comes built into R, a programming tool for statistics, and it's like a simple table where each row is a car and each column is a feature. It's useful for learning because it's small and real-world, showing how car design affects fuel efficiency. No need to download it; it's already there in R. We don't know who collected it exactly, but it's from a magazine called Motor Trend. Think of it as a spreadsheet: for example, a Mazda RX4 has 21 mpg, weighs 2.62 thousand pounds, and has 110 hp. This dataset is used in Experiments 1, 2, 4, 5, and 6.

Explanation of the Algorithm in Simple Steps: Linear regression is a way to find patterns in numbers, like predicting how much gas a car uses based on its weight. It's like drawing the straightest possible line through a bunch of dots on graph paper so that the line comes as close as possible to all the dots. This line helps you guess new values.

Simple linear regression (using just one input):

  1. Gather your data: You have pairs of numbers, like weight (input) and miles per gallon (output). Plot them as dots on a graph.
  2. Find the best line: Calculate a starting point (intercept, where the line crosses the y-axis) and a slope (how steep the line is, showing how much the output changes when the input changes by 1). The "best" line is the one where the total distance from the line to all dots is as small as possible (we measure this with something called "least squares").
  3. Check how good it is: See if the line explains most of the ups and downs in the data (using something like R-squared, which is like a percentage of how well it fits).
  4. Use it: For a new weight, plug it into the line equation to predict mpg.

Multiple linear regression (using more than one input, like weight and horsepower): It's the same idea, but now the "line" is in higher dimensions (like a flat surface instead of a line).

  1. Gather data with multiple inputs.
  2. Find the best plane or surface that fits the dots. Now you have one intercept and a slope for each input.
  3. Check the fit, and see which inputs matter most (using p-values, like how likely the slope isn't just random).
  4. Predict using all inputs in the equation.

This assumes the relationship is straight (linear), data is clean, and no weird outliers mess it up.

Program: To make this visual, I've added simple plotting code to show the data points and the regression line. In R, "plot" draws the graph, and "abline" adds the line. You can run this in R to see the picture. For multiple regression, a 3D plot is complex, so we plot actual vs predicted values instead.

text
data(mtcars)  # Load the built-in car data
a <- lm(mpg ~ wt, data=mtcars)  # Simple: Predict mpg from weight
summary(a)  # Show details
plot(mpg ~ wt, data=mtcars, main="Simple Regression: MPG vs Weight", xlab="Weight (1000 lbs)", ylab="Miles Per Gallon")  # Plot points
abline(a, col="red")  # Add red regression line

b <- lm(mpg ~ wt + hp, data=mtcars)  # Multiple: Add horsepower
summary(b)  # Show details
# For multiple, visualization is harder (3D), but we can plot actual vs predicted
predicted <- predict(b)  # Get predictions
plot(mtcars$mpg, predicted, main="Multiple Regression: Actual vs Predicted MPG", xlab="Actual MPG", ylab="Predicted MPG")
abline(0,1,col="blue")  # Blue line shows perfect fit

Output:

text
Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,	Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10


Call:
lm(formula = mpg ~ wt + hp, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.941  -1.600  -0.182   1.050   5.854 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
wt          -3.87783    0.63273  -6.129 1.12e-06 ***
hp          -0.03177    0.00903  -3.519  0.00145 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.593 on 29 degrees of freedom
Multiple R-squared:  0.8268,	Adjusted R-squared:  0.8148 
F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

(When you run the plotting code in R, you'll see two graphs: One is a scatter plot with dots for each car's weight and mpg, and a red line showing the simple regression fit. The line slopes down, meaning heavier cars get fewer miles per gallon. The second graph shows dots for actual mpg vs what the multiple model predicts, with a blue line for perfect predictions—most dots are close to it.)

Explanation of the Output in Simple English: First, for simple regression (mpg based on weight): The "Call" just repeats what we did. "Residuals" shows the differences between what the model predicts and the real data—like errors or mistakes in prediction. They range from -4.5 (the model guessed too high) to +6.9 (guessed too low). The middle ones (1Q, Median, 3Q) are small, meaning most predictions are close.

The "Coefficients" table is the heart: Each row is part of the line equation. "(Intercept)" is 37.3, the starting point—if weight was 0 (not real), mpg would be 37.3. "wt" is -5.3, the slope: for every extra 1000 pounds, mpg drops by 5.3. "Std. Error" is like uncertainty in that number (small is good). "t value" tests if the slope is real (high means yes). "Pr(>|t|)" is p-value: tiny like 1.29e-10 means it's not random chance, very reliable. *** stars mean extremely significant.

"Residual standard error" is 3.0, average mistake size. "Multiple R-squared" 0.75 means weight explains 75% of mpg differences (good, but room for more). "Adjusted" is similar but penalizes extras. "F-statistic" 91 with tiny p-value means the whole model works well. Degrees of freedom (DF) are like sample size minus parameters.

For multiple regression: Similar structure. Residuals smaller overall (-3.9 to 5.9). Coefficients: Intercept 37.2, wt -3.9 (less impact now because hp helps explain), hp -0.03 (small drop per hp). P-values show wt super significant (*), hp good (). R-squared 0.83—better fit (83% explained). Error down to 2.6. F-statistic strong. The visualizations confirm: red line fits dots well; predicted vs actual dots hug the blue line.


Experiment 2

Aim: To perform logistic regression using R. This is for predicting yes-or-no outcomes, like if a car has an automatic (0) or manual (1) transmission based on its mpg and hp.

Explanation of the Dataset: We're using "mtcars" again (explained in Experiment 1). Here, we focus on "am" (transmission: 0 for automatic, 1 for manual), along with mpg and hp.

Explanation of the Algorithm in Simple Steps: Logistic regression is like linear regression but for categories, not numbers. Instead of a straight line, it makes an S-shaped curve to predict probabilities between 0 and 1 (like chance of manual transmission).

  1. Gather data: Inputs like mpg, hp, and a yes/no output (am: 0 auto, 1 manual).
  2. Transform to odds: Use linear equation, but squash it with a "logistic" function to get probabilities (e.g., over 0.5 means yes).
  3. Fit the curve: Adjust slopes to minimize wrong predictions (using maximum likelihood, like finding the most likely fit).
  4. Check: See which inputs matter, and how well it separates yes/no.
  5. Predict: For new data, get probability and pick the side.

Program: No easy single visualization (it's probabilities), but we can plot predicted probabilities vs mpg. Run in R to see.

text
data(mtcars)  # Use same car data
a <- glm(am ~ mpg + hp, data=mtcars, family=binomial)  # Logistic: Predict transmission type
summary(a)  # Show details
# Optional plot: Predicted probabilities vs mpg
predicted_prob <- predict(a, type="response")
plot(mtcars$mpg, predicted_prob, main="Logistic: Probability of Manual vs MPG", xlab="MPG", ylab="Prob Manual", col=mtcars$am + 1)  # Red auto, black manual

Output:

text
Call:
glm(formula = am ~ mpg + hp, family = binomial, data = mtcars)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -6.8773     3.4175  -2.013   0.0442 *
mpg           0.4039     0.2003   2.017   0.0437 *
hp           -0.0348     0.0234  -1.488   0.1367  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.230  on 31  degrees of freedom
Residual deviance: 24.049  on 29  degrees of freedom
AIC: 30.049

Number of Fisher Scoring iterations: 6

(If plotted, the graph shows dots for each car: higher mpg means higher chance of manual, forming a curve-like pattern. Colors distinguish actual transmission types.)

Explanation of the Output in Simple English: "Call" repeats the command. "Coefficients" table: Like linear, but in log-odds. "(Intercept)" -6.9: Base log-odds negative, meaning default leans to auto (0). "mpg" 0.4: For each extra mpg, log-odds of manual increase by 0.4 (odds multiply by e^0.4 ≈ 1.5, so higher mpg favors manual). "hp" -0.03: Higher hp decreases log-odds slightly (odds multiply by e^-0.03 ≈ 0.97, favors auto). "Std. Error" uncertainty (small good). "z value" like t, tests significance. "Pr(>|z|)" p-value: mpg 0.04 (* star, significant), hp 0.14 (not significant, no star).

"Dispersion" is 1 for binomial (yes/no) data—standard. "Null deviance" 43.2: How "messy" data is without model (higher worse, like total error guessing average). "Residual deviance" 24.0: Mess after model—lower means better fit. Degrees of freedom: 31 total samples minus parameters. "AIC" 30: Model quality score (lower better for comparison; penalizes complexity). "Iterations" 6: How many math steps to converge on best fit. Visualization: Dots rise with mpg; high-prob dots are manuals (black), low are autos (red).


Experiment 3

Aim: To perform linear discriminant analysis using R. This sorts items into groups, like classifying flowers into types based on measurements.

Explanation of the Dataset (First Time Used in Experiments 1-9): The dataset is "iris." Picture 150 flowers from three types: setosa, versicolor, virginica. Each flower has four measurements: sepal length/width (outer parts), petal length/width (inner colorful parts), all in centimeters. It's like a garden catalog table—50 flowers per type. Collected by a scientist named Edgar Anderson in 1930s, famous for learning classification because groups are somewhat separate but overlap a bit. Built into R, no download needed. For example, setosa has small petals, virginica big ones. Helps see how features distinguish species. This dataset is used in Experiments 3, 6, 7, 8, and 9.

Explanation of the Algorithm in Simple Steps: Linear discriminant analysis (LDA) is like drawing lines on a map to divide areas for different groups. It maximizes separation.

  1. Gather data with groups and features. Calculate averages for each group.
  2. Find new directions (discriminants): Combine features into 1-2 new ones that spread groups apart while keeping same-group close.
  3. Project data onto these: Like rotating the map for best view.
  4. Classify new items: See which group area it falls into. Assumes normal data distribution and equal spreads.

Program: Visualization: Plot the LD scores to see group separation. Run in R.

text
library(MASS)  # For LDA function
data(iris)  # Load flower data
a <- lda(Species ~ ., data=iris)  # LDA on all features
a  # Show details
# Plot: LD1 vs LD2
plot(a, main="LDA: Flower Groups Separation")  # Dots colored by species

Output:

text
Call:
lda(Species ~ ., data = iris)

Prior probabilities of groups:
    setosa versicolor  virginica 
  0.3333333  0.3333333  0.3333333 

Group means:
           Sepal.Length Sepal.Width Petal.Length Petal.Width
setosa            5.006       3.428        1.462       0.246
versicolor        5.936       2.770        4.260       1.326
virginica         6.588       2.974        5.552       2.026

Coefficients of linear discriminants:
                    LD1         LD2
Sepal.Length  0.8293776  0.02410215
Sepal.Width   1.5344731  2.16452123
Petal.Length -2.2012117 -0.93192121
Petal.Width  -2.8104603  2.83918785

Proportion of trace:
   LD1    LD2 
0.9912 0.0088

(Plot shows three clusters: setosa far left on LD1, versicolor and virginica overlapping a bit on the right.)

Explanation of the Output in Simple English: "Call" repeats command. "Prior probabilities": Assumes each species equal chance (0.33 or 33%), unless data says otherwise. "Group means" table: Averages per feature per group. Setosa: small sepals (5.0 length, 3.4 width), tiny petals (1.5 length, 0.2 width). Versicolor: medium (5.9, 2.8, 4.3, 1.3). Virginica: largest (6.6, 3.0, 5.6, 2.0). Shows natural differences—petals best separator.

"Coefficients of linear discriminants": How to make new features LD1 and LD2 from originals. LD1 (main separator): Positive for sepals (bigger sepals pull positive), negative for petals (bigger petals pull negative). E.g., setosa small petals = less negative pull, higher LD1. LD2 similar but different weights.

"Proportion of trace": LD1 captures 99.1% of group separation (almost all), LD2 0.9% (tiny). For 3 groups, max 2 LDs. Visualization: Graph plots flowers on LD1 (x) vs LD2 (y); setosa cluster separate, others close but distinct—shows LDA works well.


Experiment 4

Aim: To perform ridge regression using R. This is like linear regression but with a twist to handle tricky data by shrinking effects to avoid overreacting to noise.

Explanation of the Dataset: Using "mtcars" again (explained in Experiment 1). Here, inputs are wt and hp, output mpg.

Explanation of the Algorithm in Simple Steps: Ridge is for when normal regression overfits (too wiggly from noise).

  1. Start like linear: Find slopes.
  2. Add penalty: Shrink slopes toward zero based on a lambda (s=0.1 here, small penalty).
  3. Balance: Penalty prevents big slopes.
  4. Choose lambda: Often by testing, here fixed. Good for correlated inputs or small data.

Program: Visualization: Plot how coefficients change with penalty. Run in R.

text
library(glmnet)  # For ridge
data(mtcars)  # Car data
x <- as.matrix(mtcars[, c("wt", "hp")])  # Inputs
y <- mtcars$mpg  # Output
a <- glmnet(x, y, alpha=0)  # Ridge (alpha=0 for ridge, not lasso)
coef(a, s=0.1)  # Coefs at lambda=0.1
plot(a, main="Ridge: Coefficients Shrink with Penalty")  # Shows paths

Output:

text
3 x 1 sparse Matrix of class "dgCMatrix"
                     1
(Intercept) 33.7335205
wt          -3.5148864
hp          -0.0255073

(Plot: Lines for wt and hp coefficients starting at normal values, shrinking toward zero as lambda increases.)

Explanation of the Output in Simple English: This is a matrix of coefficients at penalty s=0.1. "(Intercept)" 33.7: Starting point. "wt" -3.5: Shrunk from normal -5.3. "hp" -0.03: Also shrunk slightly. Penalty makes model more stable, less sensitive. Sparse matrix means efficient storage (mostly zeros, but here all filled). Visualization: Graph has lambda on x (log scale), coefficients on y; wt line drops faster (more penalized), hp less—shows trade-off.


Experiment 5

Aim: To perform cross-validation and bootstrap using R. These test model reliability by resampling data to estimate how well it works on new data.

Explanation of the Dataset: Using "mtcars" again (explained in Experiment 1). Model predicts mpg from wt and hp.

Explanation of the Algorithm in Simple Steps: Cross-validation (CV): To check if model generalizes.

  1. Split data into K parts (here 5).
  2. Train on K-1, test on 1, repeat K times.
  3. Average errors for true performance estimate.

Bootstrap: For uncertainty in stats.

  1. Randomly pick samples with replacement (same size as original, some repeated, some missed).
  2. Calculate stat (like mean) on each bootstrap sample.
  3. Repeat many (100), see average, bias (difference from original), and spread (std error).

Both avoid overfitting by simulating new data.

Program: No direct viz for CV, but histogram for bootstrap distribution. Run in R.

text
library(boot)  # For CV and boot
data(mtcars)
a <- glm(mpg ~ wt + hp, data=mtcars)  # Model
b <- cv.glm(mtcars, a, K=5)  # 5-fold CV
b$delta  # Errors
c <- function(d, i) { mean(d$mpg[i]) }  # Mean mpg function
d <- boot(mtcars, c, R=100)  # 100 boots
d  # Results
hist(d$t, main="Bootstrap: Distribution of Mean MPG")  # Viz variation

Output:

text
[1] 7.396 7.396

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = mtcars, statistic = c, R = 100)


Bootstrap Statistics :
    original      bias    std. error
t1*  20.09062 -0.0584375     1.065667

(Histogram: Bell-shaped curve around 20 mpg, showing possible means from resamples.)

Explanation of the Output in Simple English: First line: CV delta [7.396, 7.396]—two same because adjusted/raw error. This is average squared prediction error across folds; sqrt(7.4) ≈ 2.7 mpg average mistake on unseen data.

Then bootstrap: "ORDINARY NONPARAMETRIC" means standard resampling without assumptions. "Call" repeats. "Bootstrap Statistics": "original" 20.09 (real mean mpg). "bias" -0.06: Average bootstrap means are slightly lower (tiny, good). "std. error" 1.07: How much mean varies across boots—like uncertainty, mean likely 20.09 ± 21.07 (95% range). t1 is the stat (only one). Visualization: Histogram peaks at ~20, spread shows sampling variation; normal shape means reliable.


Experiment 6

Aim: To fit classification and regression trees using R. Trees are decision flowcharts for predictions—regression for numbers, classification for categories.

Explanation of the Dataset: Using "mtcars" (Experiment 1) for regression tree (predict mpg). "iris" (Experiment 3) for classification tree (predict species).

Explanation of the Algorithm in Simple Steps: Trees split data like questions in a game.

  1. Start with all data at root.
  2. Find best split (feature/value) that reduces mess—variance (spread) for numbers, impurity (mix) for categories.
  3. Split into branches, repeat recursively.
  4. Stop when groups pure/small. Leaves give average (regression) or majority (classification). Prune if too bushy.

Program: Viz: Plot the tree structures. Run in R.

text
library(rpart)  # For trees
data(mtcars)
a <- rpart(mpg ~ wt + hp, data=mtcars)  # Regression tree
print(a)
plot(a); text(a, main="Regression Tree for MPG")  # Viz tree

data(iris)
b <- rpart(Species ~ ., data=iris)  # Classification tree
print(b)
plot(b); text(b, main="Classification Tree for Iris")  # Viz

Output:

text
n= 32 

node), split, n, deviance, yval
      * denotes terminal node

1) root 32 1126.04700 20.09062  
  2) hp>=140 11  134.31270 14.50909  
    4) wt>=3.325 6   25.45417 12.66667 *
    5) wt< 3.325 5   28.82800 16.74000 *
  3) hp< 140 21  404.66670 23.09524  
    6) wt>=3.19 8   63.31500 18.50000 *
    7) wt< 3.19 13  103.63080 26.00000  
      14) wt>=2.465 6   35.03333 23.35000 *
      15) wt< 2.465 7   16.51429 28.22857 *

n= 150 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 150 100 setosa (0.33333 0.33333 0.33333)  
  2) Petal.Length< 2.45 50   0 setosa (1 0 0) *
  3) Petal.Length>=2.45 100   0 versicolor (0 0.5 0.5)  
    6) Petal.Width< 1.75 54   5 versicolor (0 0.90741 0.092593) *
    7) Petal.Width>=1.75 46   1 virginica (0 0.021739 0.97826) *

(Plots: Tree diagrams with boxes for nodes, branches for splits, leaves with values.)

Explanation of the Output in Simple English: First (regression): "n=32" total cars. Each line is a node: #) description, split rule, n in group, deviance (error-like, lower better), yval (average mpg). Root: All 32, high deviance 1126, avg 20.1. Split1: hp>=140 to left (11 cars, avg 14.5, lower deviance). Subsplit: wt>=3.325 (6 heavy high-hp: 12.7*). * means leaf (end). Right branch hp<140 (21, avg 23.1), splits on wt. Leaves: E.g., light low-hp: 28.2. Tree prioritizes hp then wt.

Second (classification): "n=150" flowers. Nodes: split, n, loss (misclassifications), yval (majority class), (probs). Root: Mixed, loss 100 (2/3 wrong if guess one). Split1: Petal.Length<2.45 left (50 setosa, loss 0 perfect*). Right: 100 mixed vers/virg (loss 0? Wait, 50/50). Subsplit: Petal.Width<1.75 (54 mostly vers, loss 5 errors, probs 0 set, 0.91 vers, 0.09 virg). Other leaf: 46 mostly virg (loss 1, probs 0, 0.02 vers, 0.98 virg). Visualization: Tree pics show flowchart—easy to follow decisions.


Experiment 7

Aim: To perform K-nearest neighbors using R. This predicts by looking at similar examples, like asking neighbors.

Explanation of the Dataset: Using "iris" (explained in Experiment 3). Here, first 100 flowers (setosa and versicolor), features 1-4, labels species.

Explanation of the Algorithm in Simple Steps: Like "birds of a feather flock together."

  1. For a new item, measure distance to all known items (e.g., Euclidean, like straight-line on map).
  2. Find K closest neighbors (here K=3).
  3. For classification: Majority vote on their labels. (For regression: Average.) Simple, no pre-training, but slow for big data; sensitive to scale.

Program: Viz: Plot first two features colored by labels to see clusters. Run in R.

text
library(class)  # For KNN
data(iris)
a <- iris[1:100, 1:4]  # Features first 100
b <- iris[1:100, 5]  # Labels
c <- knn(a, a[1:5,], b, k=3)  # Predict first 5 using all
c
# Plot: First two features, colored by label
plot(iris[1:100,1:2], col=as.integer(b), main="KNN: Iris Setosa/Versicolor")  # Shows clusters

Output:

text
[1] setosa setosa setosa setosa setosa
Levels: setosa versicolor

(Plot: Dots in two groups, setosa clustered with smaller sepals left.)

Explanation of the Output in Simple English: The array [1] lists predictions for first 5 items: All "setosa" (correct, as first 50 are setosa). "Levels" shows possible classes. Here, testing on training data, so perfect; in real, use separate test. Visualization: Scatter plot sepals length (x) vs width (y); colors show setosa (say red) small/tight cluster, versicolor (black) larger/spread—neighbors likely same color.


Experiment 8

Aim: To perform principal component analysis using R. This simplifies data by combining features into fewer, capturing main patterns.

Explanation of the Dataset: Using "iris" (explained in Experiment 3). Features 1-4 (measurements).

Explanation of the Algorithm in Simple Steps: PCA reduces dimensions, like summarizing a book to key themes.

  1. Center data (subtract averages per feature).
  2. Find directions of max variance (spread): Principal components (PCs), orthogonal (perpendicular).
  3. Project data onto top PCs (new coordinates).
  4. Keep top few that explain most variance. Loses some info but simplifies.

Program: Viz: Biplot shows features and data. Run in R.

text
data(iris)
a <- prcomp(iris[,1:4])  # PCA on features
summary(a)
biplot(a, main="PCA: Iris Features")  # Viz loadings and scores

Output:

text
Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     2.0563 0.4926 0.2797 0.1544
Proportion of Variance 0.9246 0.0531 0.0171 0.0052
Cumulative Proportion  0.9246 0.9777 0.9948 1.0000

(Biplot: Arrows for feature contributions, dots for flowers colored by species.)

Explanation of the Output in Simple English: "Importance of components" table: Four PCs (one per feature). "Standard deviation": Spread along PC—PC1 2.06 (big), PC4 0.15 (tiny). "Proportion of Variance": % info captured—PC1 92.5% (main pattern, maybe overall size), PC2 5.3%, total first two 97.8%. "Cumulative": Adds up to 100%. Drop last two for simplicity. Visualization: Biplot has dots (flowers) clustered by species; arrows (features)—petal length/width long/same direction (correlated), point to virginica cluster.


Experiment 9

Aim: To perform K-means clustering using R. This groups similar items automatically without labels.

Explanation of the Dataset: Using "iris" (explained in Experiment 3). Features 1-4.

Explanation of the Algorithm in Simple Steps: K-means finds clusters like sorting candies by color/shape.

  1. Choose K (3 here). Pick K random centers.
  2. Assign each item to closest center (Euclidean distance).
  3. Update centers to average of their group.
  4. Repeat till centers stable. May vary with start; run multiple.

Program: Viz: Plot clusters on first two features. Run in R.

text
data(iris)
a <- kmeans(iris[,1:4], centers=3)  # 3 clusters
a$cluster  # Assignments
plot(iris[,1:2], col=a$cluster, main="K-means: Iris Clusters")  # Color by cluster

Output:

text
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
 [75] 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2
[112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[149] 2 2

(Plot: Three colored groups roughly matching species.)

Explanation of the Output in Simple English: "$cluster" array: Numbers 1-3 for each of 150 flowers. E.g., first 50 all 1 (setosa group), 51-100 mostly 3 (versicolor, some 2 mix), 101-150 all 2 (virginica). Matches real species well, but not perfect (vers/virg overlap). Visualization: Scatter sepal length vs width, colors show clusters—1 small sepals, 3 medium, 2 large; tight groups.

Thursday, 4 September 2025

Predictive and advanced Analytics Using R notes Krishna University B.Sc Data Science Honours

Data Caching (In-Memory Data Management for Efficiency)

Imagine you're a shopkeeper who wants to quickly answer questions like:

  • What items are sold together?

  • What are the best-selling combinations?

  • How to group customers?

To answer these quickly, you don’t want to go back to your storeroom (disk) every time — instead, you want to keep useful data in memory (brain or register).

This is what in-memory data caching helps with in data mining — it avoids slow disk operations.


🧠 Concept 1: Data Caching (In-Memory Data Management)

Definition: Storing frequently accessed or preprocessed data in memory (RAM) to avoid reading from disk repeatedly, making the process faster.


Now let’s explain the 6 common in-memory caching techniques using shop examples and simple data mining analogies:


✅ 1. Data Cube Materialization

What it is: Pre-calculating and storing summary tables (cuboids) for different combinations of dimensions (like "item", "month", "region").

Analogy:
You create ready-made sales summaries for:

  • Item + Region

  • Item + Month

  • Region + Month

So when someone asks:

“How many T-shirts were sold in June in Delhi?”
You don't have to add up everything — it's already precomputed and stored in memory.

Use: Makes OLAP queries super fast.


✅ 2. Vertical Data Format (TID Lists) — used in Eclat Algorithm

What it is: Instead of storing full transactions, you store:

  • Each item → list of transaction IDs (TIDs) where it appears

Analogy:
Instead of:

T1 → Milk, Bread  
T2 → Milk, Eggs  

You store:

Milk → T1, T2  
Bread → T1  
Eggs → T2

Now to find common items:

  • Milk ∩ Bread → T1

  • Milk ∩ Eggs → T2

Challenge: These TID lists can be large → so we split or partition them into smaller sets to keep them in memory.


✅ 3. FP-Tree (Frequent Pattern Tree)

What it is: A compressed tree structure that stores frequent itemsets without listing every transaction.

Analogy:
Instead of remembering every customer's shopping list, you draw a tree:

[Milk, Bread]  
[Milk, Eggs]  
[Milk, Bread, Eggs]

→ becomes a tree:

Milk  
 └── Bread  
     └── Eggs  
 └── Eggs

Use: Makes pattern mining faster and saves space.


✅ 4. Projected Databases

What it is: Smaller sub-databases focused only on frequent items — built during recursion in algorithms like PrefixSpan.

Analogy:
If you’re analyzing "customers who bought Milk", you ignore others and only use a filtered copy of the database with Milk-based transactions.

Why in memory?
To avoid reading filtered data from disk again and again during recursive calls.


✅ 5. Partitioned Ensembles

What it is: Split the full transaction data into small pieces that fit into RAM, and process them one by one.

Analogy:
If your notebook is too big to read at once, you tear out 10 pages at a time, work on them, and stitch the results.

Use: Especially helpful when memory is limited.


✅ 6. AVC-Sets (Attribute-Value-Class Sets)

What it is: For each attribute (feature), store a summary of how many times each value appears with each class (label). Used in decision tree building.

Analogy:
You’re building a tree to predict "Will customer return?"
You keep in memory:

Age Buys Returns?
20s Yes 4 times
30s No 6 times
20s No 1 time

So you don’t have to scan full data again to calculate "best split".


📊 Summary Table

Technique What It Does Simple Analogy
Data Cube Precompute summaries Ready-made total sales by category
TID Lists (Eclat) Store item → transaction IDs Item lookup book
FP-Tree Compress frequent items into a tree Combine repeated paths like Milk → Bread
Projected DB Use filtered, smaller datasets Only analyze “Milk buyers” group
Partitioned Ensembles Split DB into memory-sized chunks Tear out a few pages at a time
AVC-Sets Store class counts per feature-value Summary table for decision trees

🚀 Final Thoughts

All of these techniques are about managing large data in small memory by:

  • Preprocessing

  • Compressing

  • Partitioning

  • Avoiding disk reads

This is very important in data mining where speed matters and datasets can be huge.


Would you like a visual version (PDF/diagram) of this explanation or a real dataset example applying FP-Tree or Eclat?

CHAID (Chi-squared Automatic Interaction Detection) made easy

🔍 CHAID Uses Chi-Square Test — Not Info Gain or Gini

In CHAID (Chi-squared Automatic Interaction Detection), instead of calculating entropy or gain, we calculate a Chi-square statistic to measure how strongly each input (feature) is related to the target (label).


📌 CHAID Splitting Criterion

👉 The attribute with the smallest p-value from the Chi-square test is chosen as the splitting feature.

A small p-value means a strong relationship between the predictor and the target.


📐 Formula Used in CHAID

The Chi-square formula is:

χ2=(OE)2E\chi^2 = \sum \frac{(O - E)^2}{E}

Where:

  • χ2\chi^2 = Chi-square statistic

  • OO = Observed frequency (what’s in your data)

  • EE = Expected frequency (if there's no relationship)

  • The sum is over all combinations of feature and class


✅ Example: Mini Chi-square Test

Let’s go back to this dataset:

Student Buys Laptop
Yes Yes (3)
No No (3)

We can fill a 2x2 table:

Buys = Yes Buys = No Total
Student = Yes 3 0 3
Student = No 0 3 3
Total 3 3 6

Step 1: Compute Expected Values

For each cell:

E=Row Total×Column TotalGrand TotalE = \frac{\text{Row Total} \times \text{Column Total}}{\text{Grand Total}}

Example:

  • Expected (Student = Yes, Buys = Yes) = 3×36=1.5\frac{3 × 3}{6} = 1.5

  • Expected (Student = Yes, Buys = No) = 3×36=1.5\frac{3 × 3}{6} = 1.5

  • Expected (Student = No, Buys = Yes) = 1.5

  • Expected (Student = No, Buys = No) = 1.5

Step 2: Use Chi-square Formula

χ2=(OE)2E\chi^2 = \sum \frac{(O - E)^2}{E} =(31.5)21.5+(01.5)21.5+(01.5)21.5+(31.5)21.5= \frac{(3 - 1.5)^2}{1.5} + \frac{(0 - 1.5)^2}{1.5} + \frac{(0 - 1.5)^2}{1.5} + \frac{(3 - 1.5)^2}{1.5} =2.251.5+2.251.5+2.251.5+2.251.5=4×1.5=6.0= \frac{2.25}{1.5} + \frac{2.25}{1.5} + \frac{2.25}{1.5} + \frac{2.25}{1.5} = 4 \times 1.5 = 6.0

Now compare this Chi-square statistic with a critical value from the Chi-square table (for degree of freedom = 1).
If it’s greater, we say the relationship is significant → this feature is good for splitting.


📝 Summary: CHAID vs. ID3/C4.5/CART

Feature CHAID ID3 / C4.5 / CART
Uses what to split? Chi-square test Entropy, Info Gain, Gini
Data type preferred Categorical (like Yes/No) Categorical + Numerical
Splits how? Can have many branches per split Binary (CART), multi (ID3)
Good for Marketing, social science data General-purpose decision trees

Great! Let’s walk through a step-by-step, pen-and-paper-style Chi-square test for a 3x2 table, using simple values so you can calculate everything easily.


🎯 GOAL:

We'll apply the Chi-square test to find whether a predictor (like Age Group) is related to a target (like Buys Laptop).


🧾 Example Dataset

Person Age Group Buys Laptop
1 Young Yes
2 Young No
3 Middle Yes
4 Middle No
5 Old No
6 Old Yes

🪜 Step 1: Build a Frequency Table

Let’s count how many Yes/No for each age group:

Age Group Buys = Yes Buys = No Row Total
Young 1 1 2
Middle 1 1 2
Old 1 1 2
Col Total 3 3 6

This is called a contingency table.


🧮 Step 2: Calculate Expected Frequencies (E)

Use this formula:

E=Row Total×Column TotalGrand TotalE = \frac{\text{Row Total} \times \text{Column Total}}{\text{Grand Total}}

Each cell's expected value:

  • For Young, Yes: 2×36=1\frac{2 \times 3}{6} = 1

  • For Young, No: 2×36=1\frac{2 \times 3}{6} = 1

  • Same for Middle and Old.

So, the Expected table (E) is:

Age Group Expected Yes Expected No
Young 1 1
Middle 1 1
Old 1 1

🧾 Step 3: Apply the Chi-Square Formula

χ2=(OE)2E\chi^2 = \sum \frac{(O - E)^2}{E}

Let’s compute for each cell:

All Observed = Expected (from the frequency table), so:

  • (OE)2=0(O - E)^2 = 0 for every cell

  • Therefore, χ2=0\chi^2 = 0

✅ Final Result:

χ2=0\chi^2 = 0

That means:
There is no relationship between Age Group and Buys Laptop in this dataset — the observed values match the expected.


🎯 Now Let’s Try a More Interesting Case

Let’s change the dataset slightly:

Person Age Group Buys Laptop
1 Young Yes
2 Young Yes
3 Middle Yes
4 Middle No
5 Old No
6 Old No

Now build the frequency table:

Age Group Buys = Yes Buys = No Row Total
Young 2 0 2
Middle 1 1 2
Old 0 2 2
Col Total 3 3 6

🔢 Step-by-step Expected Values

For each cell:

  • Expected (Young, Yes) = 2×36=1\frac{2 × 3}{6} = 1

  • Expected (Young, No) = 1

  • Expected (Middle, Yes) = 1

  • Expected (Middle, No) = 1

  • Expected (Old, Yes) = 1

  • Expected (Old, No) = 1

So:

Age Group O (Yes) E (Yes) O (No) E (No)
Young 2 1 0 1
Middle 1 1 1 1
Old 0 1 2 1

🔍 Step 4: Apply the Chi-Square Formula

Now compute:

χ2=(OE)2E\chi^2 = \sum \frac{(O - E)^2}{E}

Let’s go cell by cell:

  • (Young, Yes): (21)21=1\frac{(2 - 1)^2}{1} = 1

  • (Young, No): (01)21=1\frac{(0 - 1)^2}{1} = 1

  • (Middle, Yes): (11)21=0\frac{(1 - 1)^2}{1} = 0

  • (Middle, No): (11)21=0\frac{(1 - 1)^2}{1} = 0

  • (Old, Yes): (01)21=1\frac{(0 - 1)^2}{1} = 1

  • (Old, No): (21)21=1\frac{(2 - 1)^2}{1} = 1

✅ Final Chi-square Value:

χ2=1+1+0+0+1+1=4\chi^2 = 1 + 1 + 0 + 0 + 1 + 1 = \boxed{4}


📊 What Does This Mean?

To interpret this value, compare with the critical value from the Chi-square table for:

  • Degrees of Freedom = (Rows - 1) × (Cols - 1) = (3 − 1) × (2 − 1) = 2

  • For significance level 0.05 → Critical value = 5.99

Since χ2=4\chi^2 = 4 < 5.99
→ ❌ Not significant at 0.05 level
→ So we do not split based on Age Group


✅ Summary of Steps

Step What You Do
1 Create a frequency table (observed values)
2 Calculate expected values
3 Use the formula: χ2=(OE)2E\chi^2 = \sum \frac{(O - E)^2}{E}
4 Add up all the values
5 Compare with critical value to check significance

Support Vector Machine Simplified

We’ll use a tiny dataset with just 4 points. The goal is to separate Class A (+1) and Class B (−1) using a straight line (in 2D).


📌 Dataset

Point x₁ x₂ Class
A 1 2 +1
B 2 3 +1
C 3 3 −1
D 2 1 −1

Plot these on graph paper:

  • A (1,2) 🔵

  • B (2,3) 🔵

  • C (3,3) 🔴

  • D (2,1) 🔴


✏️ Step 1: Try a line x₂ = x₁ → i.e., line through origin at 45°

The equation of the line is:

f(x)=x2x1=0f(x) = x₂ - x₁ = 0

Let’s test each point:

Point x₁ x₂ f(x) = x₂ - x₁ Result Prediction
A 1 2 2 - 1 = +1 ≥ 0 +1 ✅
B 2 3 3 - 2 = +1 ≥ 0 +1 ✅
C 3 3 3 - 3 = 0 ≥ 0 +1 ❌
D 2 1 1 - 2 = -1 < 0 -1 ✅

❌ C is wrongly classified. So this line isn’t optimal.


✏️ Step 2: Try a better line: x₂ = x₁ + 0.5

This shifts the line upward a bit.
The equation becomes:

f(x)=x2x10.5=0f(x) = x₂ - x₁ - 0.5 = 0

Let’s test:

Point x₁ x₂ f(x) = x₂ - x₁ - 0.5 Result Prediction
A 1 2 2 - 1 - 0.5 = +0.5 ≥ 0 +1 ✅
B 2 3 3 - 2 - 0.5 = +0.5 ≥ 0 +1 ✅
C 3 3 3 - 3 - 0.5 = -0.5 < 0 -1 ✅
D 2 1 1 - 2 - 0.5 = -1.5 < 0 -1 ✅

✅ All 4 points are correctly classified!


🧮 Step 3: Express the Equation as SVM Style

SVM wants the line in this form:

w1x1+w2x2+b=0w₁·x₁ + w₂·x₂ + b = 0

Our equation:

x2x10.5=0x₂ - x₁ - 0.5 = 0

Can be rewritten as:

x1+x20.5=0-x₁ + x₂ - 0.5 = 0

So,

  • w₁ = -1

  • w₂ = +1

  • b = -0.5

This is our final separating hyperplane.


🧲 Step 4: Margin and Support Vectors

The support vectors are the closest points to the decision boundary.

Check distances from the line:

For point A(1,2):

f(x)=210.5=0.5f(x) = 2 - 1 - 0.5 = 0.5

Point C(3,3):

f(x)=330.5=0.5f(x) = 3 - 3 - 0.5 = -0.5

So points A and C are support vectors — they sit at equal distances (margin) from the decision boundary.


✅ Final Summary (like a notebook page)

📘 Final Equation:

x2x10.5=0x₂ - x₁ - 0.5 = 0

or in SVM form:

w=[1,1],b=0.5w = [-1, 1], \quad b = -0.5

📍 Support Vectors:

  • A(1,2)

  • C(3,3)

✅ Classification Rule:

  • If f(x) ≥ 0 → Class +1

  • If f(x) < 0 → Class -1


🎓 SVM in 1 Sentence:

SVM finds the best line (or curve) that maximizes the gap between two classes, using only the closest points (support vectors) to make the decision.




🎯 GOAL of SVM (in Math Terms)

Given labeled data, find the hyperplane (line) that:

  1. Separates the two classes correctly

  2. Maximizes the margin (distance from the line to the closest points)


✍️ 1. The Equation of a Hyperplane

In 2D, a line is:

w1x1+w2x2+b=0w_1x_1 + w_2x_2 + b = 0

Or, in vector form:

wx+b=0\mathbf{w}^\top \mathbf{x} + b = 0
  • w=[w1,w2]\mathbf{w} = [w_1, w_2] → weight vector (controls the direction of the line)

  • bb → bias (controls the shift up/down of the line)

  • x=[x1,x2]\mathbf{x} = [x_1, x_2] → input point


🧠 2. Classification Rule

For any point x\mathbf{x}:

Class={+1if wx+b01if wx+b<0\text{Class} = \begin{cases} +1 & \text{if } \mathbf{w}^\top \mathbf{x} + b \geq 0 \\ -1 & \text{if } \mathbf{w}^\top \mathbf{x} + b < 0 \end{cases}

📏 3. What is Margin?

Let’s say you have a line that separates the data. The margin is the distance between the line and the closest data points (called support vectors).

We want this margin to be as wide as possible.

Let’s define:

  • The distance from a point x\mathbf{x} to the line wx+b=0\mathbf{w}^\top \mathbf{x} + b = 0 is:

Distance=wx+bw\text{Distance} = \frac{|\mathbf{w}^\top \mathbf{x} + b|}{\|\mathbf{w}\|}

Where w=w12+w22\|\mathbf{w}\| = \sqrt{w_1^2 + w_2^2}


🏁 4. Optimization Objective

We want:

  • All data points classified correctly:

    yi(wxi+b)1y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1

    for all ii

    This ensures the points are on the correct side of the margin.

  • Maximize the margin = Minimize w\|\mathbf{w}\|

So the optimization problem becomes:

Minimize:

12w2\frac{1}{2} \|\mathbf{w}\|^2

Subject to:

yi(wxi+b)1for all iy_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 \quad \text{for all } i

This is called a convex optimization problem — it has one global minimum, which we can find using Lagrange Multipliers.


🧩 5. Solving Using Lagrangian (Soft Explanation)

We use the method of Lagrange Multipliers to solve this constrained optimization.

We build the Lagrangian:

L(w,b,α)=12w2i=1nαi[yi(wxi+b)1]L(\mathbf{w}, b, \boldsymbol{\alpha}) = \frac{1}{2} \|\mathbf{w}\|^2 - \sum_{i=1}^n \alpha_i [y_i(\mathbf{w}^\top \mathbf{x}_i + b) - 1]

Where:

  • αi0\alpha_i \geq 0 are the Lagrange multipliers

Then we find the saddle point (minimize LL w.r.t w,b\mathbf{w}, b and maximize w.r.t α\alpha).

This leads to a dual problem, which is easier to solve using tools like quadratic programming.


✳️ 6. Final Classifier

Once solved, we get:

w=i=1nαiyixi\mathbf{w} = \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i

This means the support vectors (where αi>0\alpha_i > 0) are the only ones used to define w\mathbf{w}. All other data points don’t affect the boundary!

Then you get the decision function:

f(x)=wx+bf(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b

Predict class:

  • If f(x)0f(\mathbf{x}) \geq 0 → +1

  • If f(x)<0f(\mathbf{x}) < 0 → −1


🪄 Intuition Summary

Concept In Simple Words
Hyperplane The best line that separates classes
Margin Gap between the line and the nearest points
Support Vectors Points lying closest to the line
Optimization Goal Maximize margin (i.e., minimize w\|\mathbf{w}\|)
Constraint Keep all points on the correct side
Lagrange Method A tool to solve optimization with constraints