Introduction to IoT - Lecture 7: Big Data

What is Big Data?

Big Data refers to datasets that are too large, complex, or fast-changing for traditional data processing methods to handle. These datasets require new tools, architectures, and analytical models to extract value and insights.

Why Big Data?

Key enablers for Big Data growth include:

Availability of data
Increase of processing power
Increase of storage capacities

Big Data Sources

Common sources include:

Social media platforms
IoT devices and sensors
Enterprise systems
Web logs and clickstreams
Multimedia content
Public/government data

Big Data Characteristics

Often described by the "5 Vs":

Volume: Huge amounts of data
Velocity: High speed of data generation and processing
Variety: Different types (text, video, etc.)
Veracity: Uncertainty of data quality
Value: Usefulness of data insights

Types of Big Data

Structured Data: Organized and easily searchable (e.g., relational databases, CSV)
Unstructured Data: No formal structure (e.g., images, audio, video)
Semi-Structured Data: Partially organized (e.g., XML, JSON)

Big Data Job Roles

Business User: Uses insights to make decisions
Project Sponsor: Funds the project
Project Manager: Oversees delivery and quality
Business Intelligenece Analyst: Offers domain-specific analytics
Data Engineer: Handles data ingestion and ETL
DataBase Administrator: Manages database environments
Data Scientist: Performs modeling and advanced analytics

Data Analytics Lifecycle

1. Business Issue Understanding

Goal: Align the analysis with business needs.
Key Actions: Define objectives, gather context, choose methods, set scope, and outline deliverables.

2. Data Understanding

Goal: Assess data relevance and quality.
Key Actions: Collect raw data, check requirements/availability, and explore its characteristics.

3. Data Preparation

Goal: Transform raw data into an analysis-ready format.
Key Actions: Merge sources, clean errors, standardize formats, blend datasets, and sample if needed.

4. Exploratory Analysis & Modeling

Goal: Extract insights and build predictive models.
Key Actions: Select methodology, identify key variables, construct models, and evaluate performance.

5. Validation

Goal: Ensure results are reliable.
Key Actions: Test outcomes, review the process, and either proceed or iterate (if results are invalid).

6. Visualization & Presentation

Goal: Communicate findings effectively.
Key Actions: Tailor visuals/storytelling to the audience, highlight insights, and propose actionable recommendations.

Data Repositories

Data Warehouse

Centralized, structured, cleaned
Supports analytics across subjects

Data Mart

Subset of data warehouse
Specific to departments or applications

Data Lake

Stores raw, unprocessed data
High flexibility, low upfront cost

Data Warehouse vs. Data Lake vs. Data Mart

Aspect	Data Warehouse	Data Lake	Data Mart
Data Type	Structured	All types (raw)	Structured, specific subset
Purpose	Analytics, reporting	Storage, advanced analytics	Focused business analysis
Schema	Defined before storage	Defined on use	Predefined
Example	Enterprise-wide sales data	IoT sensor logs	Marketing team data

Summary: Data Warehouses are optimized for structured queries, Data Lakes store diverse raw data, and Data Marts serve specific needs.

Big Data Tools

Examples include:

Apache Hadoop
Apache Spark
Apache Flink
Apache Hive
NoSQL Databases (e.g., MongoDB, Cassandra)

Apache Big Data Projects

Open-source projects under Apache that support big data processing, storage, and analytics. Core example: Apache Hadoop.

Apache Hadoop

An open-source framework for distributed storage and processing of large datasets using commodity hardware.

Key Features

Scalable and fault-tolerant
Processes structured and unstructured data
Uses parallel processing
Written in Java

Key Components of Hadoop

MapReduce – Data processing
YARN – Resource management
HDFS – Distributed storage
Hadoop Common – Core libraries

Hadoop Distributed File System (HDFS)

Stores large data files by splitting into 128MB blocks
Highly fault tolerant
Supports create, delete, move, and rename operations
Serves as default storage in Hadoop

HDFS Architecture

Master/Slave Architecture
NameNode: Manages metadata and file system namespace
DataNodes: Store actual data blocks and report to NameNode
Heartbeat and Block Reports: Ensure data integrity

NameNode

Responsible for:

Managing file system namespace
Regulating access
Storing metadata (namespace image and edit log)

DataNode

Responsible for:

Storing data blocks
Executing commands from NameNode (read/write, replicate, delete)
Performing computation (e.g., machine learning, statistics)

Block Replication

Default replication factor: 3
Ensures high availability and fault tolerance
Data remains accessible if nodes fail

YARN (Yet Another Resource Negotiator)

Manages system resources
Schedules tasks
Allows multiple data processing frameworks (e.g., batch, stream)

MapReduce

Java-based programming paradigm
Processes large datasets in parallel
Composed of two main functions:
- Map: Processes and emits intermediate key-value pairs
- Reduce: Aggregates results

Data Distribution

Data is split into chunks across cluster nodes
Distributed file systems manage chunk allocation
All chunks form a unified namespace

MapReduce Example: Word Count

Input file is split
Each Mapper processes a chunk and emits (word, 1)
Reducers sum up counts for each word
Final output: (word, total count)

MapReduce Steps

Splitting: Input divided into chunks
Mapping: Each chunk processed by a mapper
Shuffling: Intermediate outputs grouped by key
Reducing: Aggregated results produced

Keys and Values in MapReduce

Data structured as (Key, Value) pairs throughout
Mapper and Reducer must define:
- Input and output key-value pair formats

First Term

Intro to Cybersecurity

Lectures

IT Essentials

Assignments

Exams

Lectures

Sections

Math

Lectures

Models

Python

Assignments

Exams

Lectures

Sections

Second Term

C Essentials

C Essentials

Lectures

Cybersecurity Essentials

Lectures

Intro To IoT

Assignments

Lectures

MS Office

Assignments

C++

Assignments

Lectures

Summaries

Tasks

DB

Lectures

Summaries

DigitalEngineering

Assingments

Lectures

Sheets

Linux

Assignments

Lectures

Tasks

OS

Assignments

Lectures

Summaries

Summaries second attempt

WebDevelopment

Lectures

Summaries

Tasks

Chapters

Chapter One

Introduction to IoT - Lecture 7: Big Data ​

What is Big Data? ​

Why Big Data? ​

Big Data Sources ​

Big Data Characteristics ​

Types of Big Data ​

Big Data Job Roles ​

Data Analytics Lifecycle ​

Data Repositories ​

Data Warehouse ​

Data Mart ​

Data Lake ​

Data Warehouse vs. Data Lake vs. Data Mart ​

Big Data Tools ​

Apache Big Data Projects ​

Apache Hadoop ​

Key Features ​

Key Components of Hadoop ​

Hadoop Distributed File System (HDFS) ​

HDFS Architecture ​

NameNode ​

DataNode ​

Block Replication ​

YARN (Yet Another Resource Negotiator) ​

MapReduce ​

Data Distribution ​

Introduction to IoT - Lecture 7: Big Data

What is Big Data?

Why Big Data?

Big Data Sources

Big Data Characteristics

Types of Big Data

Big Data Job Roles

Data Analytics Lifecycle

Data Repositories

Data Warehouse

Data Mart

Data Lake

Data Warehouse vs. Data Lake vs. Data Mart

Big Data Tools

Apache Big Data Projects

Apache Hadoop

Key Features

Key Components of Hadoop

Hadoop Distributed File System (HDFS)

HDFS Architecture

NameNode

DataNode

Block Replication

YARN (Yet Another Resource Negotiator)

MapReduce

Data Distribution

MapReduce Example: Word Count

MapReduce Steps

Keys and Values in MapReduce