Introduction to IoT - Lecture 8: Big Data Analytics Techniques
Supervised and Unsupervised Learning
- Supervised Learning: Involves labeled data; the algorithm learns from input-output pairs.
- Examples: Classification, Regression
- Unsupervised Learning: No labeled data; the algorithm tries to learn structure from the input.
- Examples: Clustering, Association Rule Mining
Supervised Learning
- Input Data: Labeled (predefined outputs).
- Training: Uses a training dataset to learn patterns.
- Goal: Prediction (e.g., classifying spam, predicting prices).
- Types:
- Regression (continuous outputs, like house prices).
- Classification (discrete outputs, like "cat" vs. "dog").
- Classes: Known in advance.
- Analysis: Typically offline (pre-processed data).
Unsupervised Learning
- Input Data: Unlabeled (no predefined outputs).
- Training: Works directly on raw input data.
- Goal: Analysis (e.g., finding hidden patterns).
- Types:
- Clustering (grouping similar data, like customer segments).
- Association (discovering relationships, like "people who buy X also buy Y").
- Classes: Unknown—learns from data structure.
- Analysis: Often real-time (dynamic data).
Big Data Techniques Overview
| Technique | Category | Use Case |
|---|---|---|
| K-Means Clustering | Clustering | Group items by similarity |
| Apriori | Association Rules | Discover relationships between items |
| Linear/Logistic Regression | Regression | Find relationship between inputs and outcomes |
| TF-IDF | Text Analysis | Analyze and weight terms in textual data |
| Naïve Bayes, Decision Tree | Classification | Assign labels to known objects |
| ARIMA | Time Series Analysis | Forecast future values in temporal data |
Clustering
- Clustering groups similar data points together.
- Unsupervised learning technique.
- Data points in the same cluster are highly similar; different clusters are dissimilar.
Applications of Cluster Analysis
- Marketing: Target customer segments
- Biology: Classify species or gene functions
- City Planning: Group houses by features
- Other: Pattern recognition, image processing, etc.
Types of Clustering
- Exclusive (Hard) Clustering: Each data point belongs to only one cluster (e.g., K-Means).
- Overlapping (Soft) Clustering: Data points can belong to multiple clusters (e.g., Fuzzy C-Means).
- Hierarchical Clustering: Builds a tree-like cluster structure (dendrogram).
K-Means Clustering
- Unsupervised technique for partitioning
ndata points intokclusters. - Each point is assigned to the nearest centroid.
- Input: Numeric features with a defined distance metric (e.g., Euclidean).
- Output: Cluster centroids and point-cluster assignments.
Steps in K-Means
- Choose
kand initialize centroids. - Assign each point to the nearest centroid.
- Recompute centroids based on new assignments.
- Repeat steps 2–3 until convergence (centroids stabilize or oscillate).
Association Rules
- Unsupervised technique for discovering relationships between items.
- Does not predict an outcome; identifies patterns.
- Example format: If X is observed, then Y is also observed.
- Commonly used in Market Basket Analysis (e.g., customers who buy bread also buy butter).
- Example algorithm: Apriori
Regression
- Determines the relationship between input features and an output variable.
- Identifies influential variables and helps improve outcomes.
- Two major types: Linear and Logistic regression.
Linear Regression
- Models relationship between continuous outcome and input variables.
- Assumes a linear relationship.
- Is probabilistic, not deterministic.
- Can include transformations to achieve linearity.
Linear Regression Model Equation
Key Components
- Outcome Variable (
): The continuous target variable being predicted. - Input Variables (
): Features influencing , e.g., . - Intercept (
): Baseline value of when all . - Coefficients (
): Quantify the effect of each on . For example, is the change in for a 1-unit increase in , holding other variables constant. - Error Term (
): Captures unexplained variability, reflecting the model's probabilistic nature (not deterministic).
Logistic Regression
- Used when the outcome is categorical (e.g., Yes/No, Pass/Fail).
- Based on the logistic function (sigmoid).
- Outputs probabilities in the range (0, 1).
- Suitable for binary classification problems.
Logistic Regression Model Equation
Linear Component:
Logistic Function (Sigmoid):
Key Components
- Linear Predictor (
): A linear combination of input variables ( ) and coefficients ( ), similar to linear regression. - Logistic Function: Transforms
into a probability between 0 and 1, ensuring outputs are valid probabilities. - Probability Interpretation:
represents the likelihood of a binary outcome (e.g., "success" or "failure").