Sunday 26 November 2023

Factor analysis, Dimensionality reduction, Predictive analytics, Cluster Analysis, Decision Tree, Types of Decision Trees notes

Factor analysis: in factor analysis, latent variables are turned into factors i.e, they are reduced according to their functionality

Example waiting time, cleanliness, healthy and taste are latent variables

waiting time, cleanliness are service factors

healthy and taste are food experience factor

In a hotel or restaurant these latent variables are reduced to factors.

Descriptive analysis is very important in the sense of model building

Merits of factor analysis:
1. reduces amount of data
2. easy categorization
3. uses statistical methods
4. very important in model building
5. data is very easy to interpret as there are no outliers
6. is important descriptive analysis technique

Predictive analytics: is when we retrieve the unknown or future data with the help of past/current data
There are 4 types of predictive analysis
    1. classification
    2. Prediction
    3. Regression
    4. Time series
These types are used in different situations accordingly

 Dimensionality reduction: when there is data with high dimensionality working with it becomes difficult and tedious task.

to remove this we perform data reduction step in KDD process

Dimensionality reduction is a part of data reduction

Some dimensionality reduction:
1. Data cube: data is arranged in cube form to ensure there are less dimensions
2. Numerosity: we can apply rank to groups of data 
3. Feature selection: we need to find appropriate features using filtering, morphing etc
4. Normalization: to make all the data in same range or interval

Lazy Learning:  Lazy learning is a technique where we give a model some rules and testing criteria. But it only executes them when the data is given to them or from their neighbours

if we give data suppose a=2 and b=3 then only it learns from the formula otherwise it wont

it is useful when we have TB of data and we cant test all of it.

kNN is a lazy learning algorithm

Cluster Analysis
Cluster analysis is a kind of descriptive analysis where we group or cluster similar kind of data.
it is helpful for large amounts of unstructured data
it is used to further analysis and building models

Although there are different types of cluster analysis but the most popular ones are
1. partitioning clusters
2. hierarchical 
3. Density Based
4. Grid Based
1. Partitioning Methods: as the name suggest partitioning methods involves partitioning a data set and performing some analysis or methods or formulas to derive the cluster.
K- means and K- medoids are Partitioning methods

2. Hierarchical Methods: As the name suggests we need to form a hierarchy and then we can form one cluster. 
Agglomerative and Divisive are popular hierarchical methods

3. Density Methods: When we assign some density to the data points and combine them by their density. DBScan  is one of the popular density based clustering technique

4. Grid Based: in this method, we assign the density to the cluster and then store them in the form of a rectangular grid with cells
Then we apply decision tree and apply levels
Higher level cells are stored in one cluster and lower level in one

In cluster analysis, we need to choose the appropriate cluster technique for the datasets

Decision Tree: Here in this below example, 
weather is root node
humidity and speed are attributes/ columns
yes or no are class labels
This example shows if a child can play outside or not.

In decision tree, rectangle represents attributes/ columns
ellipse represents class labels

The main purpose of decision tree is to extract the rules for classification.

Example: if weather = sunny and humidity = normal then play = yes

if  weather = cloudy then play = yes

id weather = windy and speed = low then play = yes


Types of Decision Trees
1) un weighted decision tree: when there is no weight on any nodes of the decision tree, i.e, there are no biases in decision tree

2. weighted decision tree:

3. binary decision tree: where there are only two attributes or labels in a tree

4. Random forest: n number of decision trees combined

No comments:

Post a Comment