Big Data Analysis

Big Data Characteristics (5 Vs)
Hadoop Ecosystem
Spark Architecture
NoSQL Databases
Stream Processing
MapReduce Programming Paradigm
CAP Theorem
Lambda vs Kappa Architecture
Big Data Analytics Tools
Data Governance and Privacy in Big Data

1. Big Data Characteristics (5 Vs)

Overview

Big Data refers to datasets that are too large or complex for traditional data processing applications. The 5 Vs define its core characteristics:

The 5 Vs

V	Description	Details
Volume	Massive scale of data	Terabytes to petabytes; generated from social media, IoT sensors, transactions, logs. Indian govt generates massive data through Aadhaar, GSTN, UPI
Velocity	Speed of data generation/processing	Real-time or near real-time streams; need for rapid processing. Example: UPI processes millions of transactions per hour
Variety	Different data types and sources	Structured (databases), semi-structured (XML, JSON, logs), unstructured (images, videos, audio, social media posts)
Veracity	Data quality and reliability	Inconsistency, incompleteness, ambiguity, latency; "garbage in, garbage out" problem
Value	Extractable useful information	Converting raw data into actionable insights; the ultimate goal of big data analytics

Extended Vs (Additional Characteristics)

Variability: Inconsistency in data flow rates (seasonal spikes)
Visualization: Presenting data in a meaningful and understandable format
Viscosity: Resistance to flow of data between systems

2. Hadoop Ecosystem

Overview

Apache Hadoop is an open-source framework for distributed storage and processing of large datasets on commodity hardware.

Architecture

┌─────────────────────────────────────────────┐
│              HADOOP ECOSYSTEM                │
├────────────┬──────────────┬─────────────────┤
│  Storage   │  Processing  │  Management     │
│            │              │                 │
│  HDFS      │  MapReduce   │  YARN           │
│            │  Spark       │  Oozie          │
│            │  Pig         │  ZooKeeper      │
│            │  Hive        │  Ambari         │
├────────────┼──────────────┼─────────────────┤
│  Data Access │  Data Transfer  │  NoSQL      │
│  Hive        │  Sqoop          │  HBase      │
│  Pig         │  Flume          │             │
│              │  Kafka          │             │
└────────────┴──────────────┴─────────────────┘

2.1 HDFS (Hadoop Distributed File System)

Design Principles:
- Store large files (GBs to TBs)
- Streaming data access (write-once, read-many)
- Runs on commodity hardware

Architecture:
- NameNode (Master): Manages file system namespace, metadata, block mapping, replication
- DataNode (Slave): Stores actual data blocks, serves read/write requests
- Secondary NameNode: Periodically merges fsimage and edit logs; NOT a backup for NameNode

Key Features:
- Block size: 128 MB (default) or 256 MB
- Replication factor: 3 (default) — data stored on 3 different nodes
- Rack-aware replication for fault tolerance
- Read Operation: Client gets block locations from NameNode, reads directly from DataNode
- Write Operation: Client requests NameNode, which provides DataNodes; data written in pipeline

HDFS Commands Example:

hdfs dfs -mkdir /user/data
hdfs dfs -put localfile /user/data/
hdfs dfs -cat /user/data/localfile
hdfs dfs -ls /user/data

2.2 MapReduce

(Detailed in Section 6)

Programming model for parallel processing of large datasets on distributed clusters.

2.3 YARN (Yet Another Resource Negotiator)

Purpose: Resource management and job scheduling in Hadoop 2.0+

Components:
| Component | Role |
|-----------|------|
| ResourceManager (RM) | Global resource scheduler; manages cluster resources |
| NodeManager (NM) | Per-node agent; manages containers on a single node |
| ApplicationMaster (AM) | Per-application; negotiates resources and tracks progress |
| Container | Resource allocation (CPU, memory) on a specific node |

Workflow:
1. Client submits application to ResourceManager
2. ResourceManager allocates container and starts ApplicationMaster
3. ApplicationMaster requests containers from ResourceManager
4. NodeManager launches containers and monitors execution

2.4 Hive

Data warehouse infrastructure built on Hadoop
HiveQL: SQL-like query language for HDFS data
Converts queries into MapReduce/Spark/Tez jobs
Supports partitions and buckets for query optimization
Metastore: Stores table schemas and metadata
Use Case: Batch processing, ad-hoc queries, ETL

2.5 Pig

High-level platform for creating MapReduce programs
Pig Latin: Data flow language (procedural)
Supports loading, filtering, joining, grouping, sorting data
Use Case: ETL pipelines, exploratory data analysis, iterative processing

2.6 HBase

NoSQL column-oriented database on top of HDFS
Inspired by Google's BigTable
Provides random real-time read/write access to big data
Column-family storage with automatic sharding
Use Case: When you need real-time read/write access to large datasets (e.g., time-series data, messaging)

2.7 Sqoop (SQL to Hadoop)

Imports/Exports data between relational databases and Hadoop
Uses MapReduce under the hood
sqoop import --connect jdbc:mysql://host/db --table employees --target-dir /data/employees
sqoop export --connect jdbc:mysql://host/db --table employees --export-dir /data/employees

2.8 Flume

Distributed service for collecting, aggregating, and moving large log data
Components: Source → Channel → Sink
Source: Receives data (e.g., Avro, Kafka, HTTP)
Channel: Stores data temporarily (memory or file-based)
Sink: Delivers data to destination (e.g., HDFS, HBase)

2.9 Oozie

Workflow scheduler for Hadoop jobs
Manages dependencies between jobs (MapReduce, Pig, Hive, Sqoop)
Workflow: Directed Acyclic Graph (DAG) of actions
Coordinator: Schedules workflows based on time or data availability

3. Spark Architecture

Overview

Apache Spark is a fast, general-purpose cluster computing engine. It extends MapReduce with in-memory processing.

Key Advantages over Hadoop MapReduce

Feature	MapReduce	Spark
Processing	Disk-based	In-memory (up to 100x faster)
Latency	Higher (batch)	Lower (supports streaming)
Programming	Map + Reduce only	Rich transformations
Ecosystem	Needs separate tools	Unified engine
Iterative ML	Poor (repeated disk I/O)	Excellent (caching in memory)

Spark Architecture

┌──────────────────────────────────────────────┐
│              DRIVER PROGRAM                   │
│  (SparkContext, Main function, DAG Scheduler)│
└──────────────────┬───────────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────────┐
│            CLUSTER MANAGER                    │
│  (Standalone, YARN, Mesos, Kubernetes)       │
└──────────────────┬───────────────────────────┘
                   │
        ┌──────────┼──────────┐
        ▼          ▼          ▼
   ┌─────────┐┌─────────┐┌─────────┐
   │Executor ││Executor ││Executor │
   │(Worker) ││(Worker) ││(Worker) │
   └─────────┘└─────────┘└─────────┘

Core Components

3.1 RDD (Resilient Distributed Dataset)

Fundamental data structure of Spark
Immutable, distributed collection of objects
Fault-tolerant through lineage (tracks transformations to rebuild lost data)
Operations:
Transformations (lazy): map(), filter(), flatMap(), reduceByKey(), join()
Actions (trigger execution): count(), collect(), save(), take()
Persistence: persist() or cache() to keep RDD in memory

3.2 DataFrames

Distributed collection of data organized into named columns
Similar to a relational table or a Pandas DataFrame
Uses Catalyst Optimizer for query optimization
More efficient than RDDs due to schema awareness

3.3 Spark SQL

Module for working with structured data
Supports SQL queries and DataFrame API
Integrates with Hive metastore
Example: spark.sql("SELECT * FROM employees WHERE salary > 50000")

3.4 MLlib (Machine Learning Library)

Scalable machine learning library
Algorithms: Classification, regression, clustering, collaborative filtering
Utilities: Feature extraction, transformation, pipeline construction
Distributed computation across cluster

3.5 GraphX

API for graphs and graph-parallel computation
Built on top of RDDs
Supports graph algorithms: PageRank, connected components, triangle counting

3.6 Spark Streaming (Structured Streaming)

Real-time stream processing
Processes data in micro-batches (or continuous processing)
Supports sources: Kafka, Flume, HDFS, Socket
Integration with DataFrames API enables SQL queries on streaming data

Data Source → Spark Streaming → DStream/Batches → Transformations → Output Sink

4. NoSQL Databases

Overview

NoSQL (Not Only SQL) databases are non-relational databases designed for large-scale data storage and distributed processing.

Why NoSQL?

Handle massive volumes of data (horizontal scaling)
Flexible schema (schema-less or dynamic schema)
High availability and fault tolerance
Better performance for specific use cases (key-value lookups, graph traversals)

Types of NoSQL Databases

4.1 Document Databases (MongoDB)

Feature	Description
Data Model	JSON-like documents (BSON)
Schema	Flexible; documents in a collection can have different fields
Query	Rich query language, aggregation pipeline
Scaling	Horizontal via sharding
Use Case	Content management, catalogs, user profiles
Strengths	Flexible schema, rich queries, high performance

MongoDB Example Structure:

{
  "_id": ObjectId("507f1f77bcf86cd799439011"),
  "name": "Manupal",
  "city": "Mumbai",
  "skills": ["Python", "Django", "ML"]
}

4.2 Column-Family Databases (Apache Cassandra)

Feature	Description
Data Model	Column families (table-like but columns can vary per row)
Partition Key	Determines data distribution across nodes
Schema	Flexible within column families
Scaling	Linear horizontal scaling, peer-to-peer architecture
Use Case	Time-series data, IoT, messaging, write-heavy workloads
Strengths	High write throughput, no single point of failure

Cassandra Architecture:
- Peer-to-peer (no master-slave)
- Tunable consistency (ONE, QUORUM, ALL)
- Data distributed using consistent hashing
- Write path: Commit Log → MemTable → SSTable

4.3 Key-Value Databases (Redis)

Feature	Description
Data Model	Key-value pairs
Data Types	Strings, Lists, Sets, Sorted Sets, Hashes, Streams
Storage	In-memory (optionally persisted to disk)
Scaling	Redis Cluster for horizontal scaling
Use Case	Caching, session management, leaderboards, real-time analytics
Strengths	Extremely fast (sub-millisecond), versatility

4.4 Graph Databases (Neo4j)

Feature	Description
Data Model	Nodes, Relationships, Properties
Query Language	Cypher
Schema	Optional (schema-free or schema-optional)
Scaling	Causal clustering (community edition limited)
Use Case	Social networks, fraud detection, recommendation engines
Strengths	Efficient relationship queries, pattern matching

Cypher Query Example:

 MATCH (p:Person)-[:FRIENDS_WITH]->(friend:Person)
 WHERE p.name = "Manupal"
 RETURN friend.name

NoSQL Comparison Summary

Feature	MongoDB	Cassandra	Redis	Neo4j
Type	Document	Column-Family	Key-Value	Graph
Query Language	MQL	CQL	Redis commands	Cypher
Consistency	Strong	Tunable	Strong	Strong
Scaling	Sharding	Peer-to-peer	Cluster	Limited
Best For	Flexible schema	Write-heavy	Caching	Relationships
BASE/ACID	ACID (v4.0+)	BASE	ACID	ACID

BASE vs ACID Properties

Property	ACID (RDBMS)	BASE (NoSQL)
Consistency	Strong	Eventual
Availability	May sacrifice for consistency	High priority
Partition Tolerance	Limited	Built-in
Examples	PostgreSQL, MySQL	Cassandra, DynamoDB
B	-	Basically Available
S	-	Soft state
E	-	Eventual consistency

5. Stream Processing

Overview

Stream processing handles continuous data streams in real-time, processing data as it arrives rather than in batches.

Apache Kafka

Aspect	Description
Type	Distributed streaming platform / Message broker
Model	Publish-subscribe
Components	Topics, Partitions, Brokers, Producers, Consumers, Consumer Groups
Persistence	Messages persisted to disk and replicated
Use Case	Log aggregation, event sourcing, real-time analytics

Producer → [Topic (Partition 0, Partition 1, ...)] → Consumer Group

Key Concepts:
- Topic: Category/feed name to which messages are published
- Partition: Topics split across brokers for parallelism
- Offset: Unique ID for each message within a partition
- Consumer Group: Multiple consumers work together to process partitions
- ZooKeeper/KRaft: Cluster coordination

Apache Storm

Aspect	Description
Type	Distributed real-time computation system
Concepts	Spouts (sources), Bolts (processing), Topology (DAG)
Guarantee	At-least-once processing (Trident: exactly-once)
Latency	Extremely low (sub-second)
Use Case	Real-time analytics, ETL, continuous monitoring

Apache Flink

Aspect	Description
Type	Stream processing framework with batch capabilities
Model	True stream processing (not micro-batch)
Event Time	Handles out-of-order events using watermarks
State	Managed state for complex aggregations
Guarantee	Exactly-once semantics
Use Case	Real-time analytics, CEP (complex event processing), fraud detection

Stream Processing Comparison

Feature	Kafka	Storm	Flink
Type	Message Broker	Stream Processor	Stream Processor
Processing Model	N/A	Micro-batch/True streaming	True streaming
Latency	Low	Very low	Low
Throughput	Very high	Moderate	High
Exactly-once	Yes (transactions)	Trident only	Yes (native)
State management	Limited	Limited	Built-in

6. MapReduce Programming Paradigm

Overview

MapReduce is a programming model for processing large datasets in parallel across a distributed cluster.

Two Phases

Input → Split → MAP → SHUFFLE & SORT → REDUCE → Output

Phase	Description
Map	Process input records, emit key-value pairs `(k, v)`
Shuffle & Sort	Group values by key, transfer to reducers
Reduce	Aggregate values for each key, emit final `(k, v)`

Word Count Example

Input: "hello world hello"

Map Phase:

"hello world hello" → [(hello, 1), (world, 1), (hello, 1)]

Shuffle & Sort:

hello → [1, 1]
world → [1]

Reduce Phase:

hello → [1, 1] → (hello, 2)
world → [1] → (world, 1)

Key Concepts

InputFormat: Splits input data and creates key-value pairs for mappers
Combiner: Mini-reducer that runs on mapper node (optimization)
Partitioner: Determines which reducer receives which key (default: hash partitioning)
OutputFormat: Writes final output to storage

MapReduce Job Execution (Hadoop)

JobClient submits job to JobTracker
JobTracker assigns tasks to TaskTrackers
Map Tasks read input splits, apply map function
Sort & Shuffle moves intermediate data to reducers
Reduce Tasks aggregate and write final output
Output stored in HDFS

Limitations of MapReduce

Multiple disk I/O operations (map output → disk → reduce read)
Not suitable for iterative algorithms (machine learning)
Not suitable for real-time processing
High latency for interactive queries

7. CAP Theorem

Statement

A distributed data system can provide only two out of three guarantees: Consistency, Availability, and Partition Tolerance.

Three Properties

Property	Description
Consistency (C)	All nodes see the same data at the same time
Availability (A)	Every request receives a response (success or failure)
Partition Tolerance (P)	System continues to operate despite network partitions

Trade-offs

         Consistency
            /\
           /  \
          /    \
         /  CP  \
        /        \
       /    CA    \
      /            \
Availability -------- Partition Tolerance
              AP

Combination	Description	Examples
CA	Consistency + Availability (no network partition)	Traditional RDBMS (single node)
CP	Consistency + Partition Tolerance (sacrifices availability)	MongoDB, HBase, Redis Cluster
AP	Availability + Partition Tolerance (sacrifices consistency)	Cassandra, CouchDB, DynamoDB

Important Notes

Network partitions are inevitable in distributed systems, so practically the choice is between CP and AP
Brewer's revised view: The "2 out of 3" is oversimplified; systems can provide partial guarantees of all three
PACELC Theorem extends CAP: In case of Partition (P), choose between Availability (A) and Consistency (C); Else (E), choose between Latency (L) and Consistency (C)

8. Lambda vs Kappa Architecture

8.1 Lambda Architecture

Purpose: Handle both batch and real-time data processing

                    ┌──────────────┐
                    │   SERVING    │
                    │    LAYER     │
                    └──────┬───────┘
                           │
              ┌────────────┴────────────┐
              │                         │
     ┌────────▼────────┐      ┌────────▼────────┐
     │   BATCH LAYER   │      │  SPEED LAYER    │
     │ (Hadoop/Spark)  │      │(Storm/Spark     │
     │ Complete,       │      │ Streaming)      │
     │ accurate view   │      │ Real-time,      │
     └────────┬────────┘      │ approximate     │
              │               └────────┬────────┘
              └────────────┬───────────┘
                           │
                    ┌──────▼───────┐
                    │  DATA SOURCE │
                    └──────────────┘

Layers:
1. Batch Layer: Processes all historical data using Hadoop/Spark; produces accurate batch views
2. Speed Layer (Real-time): Processes recent data not yet in batch layer; produces real-time views
3. Serving Layer: Merges batch and real-time views; answers queries

Advantages: Fault tolerant, handles both batch and streaming, immutable data
Disadvantages: Complex (two code bases), double computation, hard to maintain consistency between layers

8.2 Kappa Architecture

Purpose: Simplified alternative using only stream processing

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│     DATA     │────▶│   STREAM     │────▶│   SERVING    │
│    SOURCE    │     │  PROCESSING  │     │    LAYER     │
│  (Kafka)     │     │  (Flink/     │     │              │
└──────────────┘     │   Kafka     │     └──────────────┘
                     │  Streams)   │
                     └──────────────┘

Principles:
- Everything is a stream
- Data is stored in immutable log (e.g., Kafka)
- Batch = replaying historical stream data
- Single code base for all processing

Advantages: Simpler than Lambda, single code base, real-time by default
Disadvantages: Replaying full historical data can be expensive, stream processing can be more complex than batch

Comparison

Feature	Lambda	Kappa
Complexity	High (two code bases)	Lower (single code base)
Batch Processing	Native	Via stream replay
Real-time	Separate speed layer	Native
Data Storage	HDFS + real-time store	Event log only
Maintenance	Difficult	Easier
Use Case	When batch is primary	When stream is primary

9. Big Data Analytics Tools

Hadoop Ecosystem Tools (covered in Section 2)

Additional Analytics Tools

Tool	Category	Description
Apache Zeppelin	Visualization	Web-based notebook for interactive data analytics
Tableau	BI/Visualization	Drag-and-drop dashboard creation
Power BI	BI/Visualization	Microsoft's business analytics platform
Apache Superset	BI/Open Source	Modern data exploration and visualization
Elasticsearch	Search/Analytics	Distributed search and analytics engine
Kibana	Visualization	Data visualization for Elasticsearch
Jupyter Notebook	Analytics	Interactive computing environment for data science
Grafana	Monitoring	Open-source visualization and monitoring

10. Data Governance and Privacy in Big Data

Data Governance

Framework for managing the availability, usability, integrity, and security of data.

Key Components

Component	Description
Data Quality	Accuracy, completeness, consistency, timeliness
Data Stewardship	Assigning responsibility for data management
Data Catalog	Inventory of data sources, definitions, and relationships
Data Lineage	Tracking data from source through transformations to consumption
Master Data Management (MDM)	Single source of truth for key business entities
Data Security	Access controls, encryption, masking, auditing

Data Privacy

Key Challenges

Volume: More data → more potential for privacy breaches
Variety: Unstructured data harder to classify and protect
Velocity: Real-time data harder to anonymize in transit
Data Proliferation: Data copied across multiple systems

Privacy Techniques

Technique	Description
Anonymization	Remove PII (Personally Identifiable Information)
Pseudonymization	Replace PII with artificial identifiers
Encryption	Encrypt data at rest and in transit
Access Control	Role-based access (RBAC) to sensitive data
Data Masking	Hide sensitive data while maintaining format
Differential Privacy	Add statistical noise to query results

Key Regulations

Regulation	Region	Key Points
GDPR	EU	Right to be forgotten, consent, data portability
DPDP Act 2023	India	Consent-based processing, data fiduciary obligations, right to erasure
CCPA	California, US	Consumer rights to know, delete, opt-out
HIPAA	US	Health data protection

India's Digital Personal Data Protection Act (DPDP) 2023 — Key Points

Data Principal: Individual whose data is processed
Data Fiducary: Entity determining purpose and means of processing
Consent: Freely given, specific, informed, unconditional
Rights: Right to access, correct, erase, and grievance redressal
Significant Data Fiduciary: Additional obligations (DPO, DPIA, audit)
Penalties: Up to ₹250 crore for breaches

Key Formulas and Theorems Summary

Concept	Formula/Statement
CAP Theorem	Only 2 of {C, A, P} guaranteed
PACELC	If Partition → A or C; Else → L or C
Data Replication	Default HDFS replication factor = 3
Consistent Hashing	Used for distributed data partitioning

Exam Tips

5 Vs of Big Data — memorize and know examples of each
Hadoop ecosystem — know each component's role
Spark vs MapReduce — in-memory processing advantage
NoSQL types — document, column, key-value, graph with examples
CAP theorem — understand which databases fall into CP vs AP
Lambda vs Kappa — Lambda has two layers; Kappa is stream-only
DPDP Act 2023 — India's data protection law, key terms (Data Principal, Data Fiduciary)
BASE vs ACID — fundamental trade-off in distributed databases
Kafka — distributed streaming platform, key architectural components

Practice Questions

11 MCQs for Big Data Analysis with detailed explanations.

Q1. Which of the following best describes Apache Spark?

A. a fast, general-purpose cluster computing engine. It extends MapReduce with in-memory processing.
B. published
C. oversimplified; systems can provide partial guarantees of all three
D. stream-only

✅ Correct Answer: Option A

Explanation:
The correct answer is Option A — a fast, general-purpose cluster computing engine. It extends MapReduce with in-memory processing..