Big Data Analysis

Table of Contents

  1. Big Data Characteristics (5 Vs)
  2. Hadoop Ecosystem
  3. Spark Architecture
  4. NoSQL Databases
  5. Stream Processing
  6. MapReduce Programming Paradigm
  7. CAP Theorem
  8. Lambda vs Kappa Architecture
  9. Big Data Analytics Tools
  10. Data Governance and Privacy in Big Data

1. Big Data Characteristics (5 Vs)

Overview

Big Data refers to datasets that are too large or complex for traditional data processing applications. The 5 Vs define its core characteristics:

The 5 Vs

V Description Details
Volume Massive scale of data Terabytes to petabytes; generated from social media, IoT sensors, transactions, logs. Indian govt generates massive data through Aadhaar, GSTN, UPI
Velocity Speed of data generation/processing Real-time or near real-time streams; need for rapid processing. Example: UPI processes millions of transactions per hour
Variety Different data types and sources Structured (databases), semi-structured (XML, JSON, logs), unstructured (images, videos, audio, social media posts)
Veracity Data quality and reliability Inconsistency, incompleteness, ambiguity, latency; "garbage in, garbage out" problem
Value Extractable useful information Converting raw data into actionable insights; the ultimate goal of big data analytics

Extended Vs (Additional Characteristics)


2. Hadoop Ecosystem

Overview

Apache Hadoop is an open-source framework for distributed storage and processing of large datasets on commodity hardware.

Architecture

┌─────────────────────────────────────────────┐
│              HADOOP ECOSYSTEM                │
├────────────┬──────────────┬─────────────────┤
│  Storage   │  Processing  │  Management     │
│            │              │                 │
│  HDFS      │  MapReduce   │  YARN           │
│            │  Spark       │  Oozie          │
│            │  Pig         │  ZooKeeper      │
│            │  Hive        │  Ambari         │
├────────────┼──────────────┼─────────────────┤
│  Data Access │  Data Transfer  │  NoSQL      │
│  Hive        │  Sqoop          │  HBase      │
│  Pig         │  Flume          │             │
│              │  Kafka          │             │
└────────────┴──────────────┴─────────────────┘

2.1 HDFS (Hadoop Distributed File System)

Design Principles:
- Store large files (GBs to TBs)
- Streaming data access (write-once, read-many)
- Runs on commodity hardware

Architecture:
- NameNode (Master): Manages file system namespace, metadata, block mapping, replication
- DataNode (Slave): Stores actual data blocks, serves read/write requests
- Secondary NameNode: Periodically merges fsimage and edit logs; NOT a backup for NameNode

Key Features:
- Block size: 128 MB (default) or 256 MB
- Replication factor: 3 (default) — data stored on 3 different nodes
- Rack-aware replication for fault tolerance
- Read Operation: Client gets block locations from NameNode, reads directly from DataNode
- Write Operation: Client requests NameNode, which provides DataNodes; data written in pipeline

HDFS Commands Example:

hdfs dfs -mkdir /user/data
hdfs dfs -put localfile /user/data/
hdfs dfs -cat /user/data/localfile
hdfs dfs -ls /user/data

2.2 MapReduce

(Detailed in Section 6)

Programming model for parallel processing of large datasets on distributed clusters.

2.3 YARN (Yet Another Resource Negotiator)

Purpose: Resource management and job scheduling in Hadoop 2.0+

Components:
| Component | Role |
|-----------|------|
| ResourceManager (RM) | Global resource scheduler; manages cluster resources |
| NodeManager (NM) | Per-node agent; manages containers on a single node |
| ApplicationMaster (AM) | Per-application; negotiates resources and tracks progress |
| Container | Resource allocation (CPU, memory) on a specific node |

Workflow:
1. Client submits application to ResourceManager
2. ResourceManager allocates container and starts ApplicationMaster
3. ApplicationMaster requests containers from ResourceManager
4. NodeManager launches containers and monitors execution

2.4 Hive

2.5 Pig

2.6 HBase

2.7 Sqoop (SQL to Hadoop)

2.8 Flume

2.9 Oozie


3. Spark Architecture

Overview

Apache Spark is a fast, general-purpose cluster computing engine. It extends MapReduce with in-memory processing.

Key Advantages over Hadoop MapReduce

Feature MapReduce Spark
Processing Disk-based In-memory (up to 100x faster)
Latency Higher (batch) Lower (supports streaming)
Programming Map + Reduce only Rich transformations
Ecosystem Needs separate tools Unified engine
Iterative ML Poor (repeated disk I/O) Excellent (caching in memory)

Spark Architecture

┌──────────────────────────────────────────────┐
│              DRIVER PROGRAM                   │
│  (SparkContext, Main function, DAG Scheduler)│
└──────────────────┬───────────────────────────┘
                   │
                   ▼
┌──────────────────────────────────────────────┐
│            CLUSTER MANAGER                    │
│  (Standalone, YARN, Mesos, Kubernetes)       │
└──────────────────┬───────────────────────────┘
                   │
        ┌──────────┼──────────┐
        ▼          ▼          ▼
   ┌─────────┐┌─────────┐┌─────────┐
   │Executor ││Executor ││Executor │
   │(Worker) ││(Worker) ││(Worker) │
   └─────────┘└─────────┘└─────────┘

Core Components

3.1 RDD (Resilient Distributed Dataset)

3.2 DataFrames

3.3 Spark SQL

3.4 MLlib (Machine Learning Library)

3.5 GraphX

3.6 Spark Streaming (Structured Streaming)

Data Source → Spark Streaming → DStream/Batches → Transformations → Output Sink

4. NoSQL Databases

Overview

NoSQL (Not Only SQL) databases are non-relational databases designed for large-scale data storage and distributed processing.

Why NoSQL?

Types of NoSQL Databases

4.1 Document Databases (MongoDB)

Feature Description
Data Model JSON-like documents (BSON)
Schema Flexible; documents in a collection can have different fields
Query Rich query language, aggregation pipeline
Scaling Horizontal via sharding
Use Case Content management, catalogs, user profiles
Strengths Flexible schema, rich queries, high performance

MongoDB Example Structure:

{
  "_id": ObjectId("507f1f77bcf86cd799439011"),
  "name": "Manupal",
  "city": "Mumbai",
  "skills": ["Python", "Django", "ML"]
}

4.2 Column-Family Databases (Apache Cassandra)

Feature Description
Data Model Column families (table-like but columns can vary per row)
Partition Key Determines data distribution across nodes
Schema Flexible within column families
Scaling Linear horizontal scaling, peer-to-peer architecture
Use Case Time-series data, IoT, messaging, write-heavy workloads
Strengths High write throughput, no single point of failure

Cassandra Architecture:
- Peer-to-peer (no master-slave)
- Tunable consistency (ONE, QUORUM, ALL)
- Data distributed using consistent hashing
- Write path: Commit Log → MemTable → SSTable

4.3 Key-Value Databases (Redis)

Feature Description
Data Model Key-value pairs
Data Types Strings, Lists, Sets, Sorted Sets, Hashes, Streams
Storage In-memory (optionally persisted to disk)
Scaling Redis Cluster for horizontal scaling
Use Case Caching, session management, leaderboards, real-time analytics
Strengths Extremely fast (sub-millisecond), versatility

4.4 Graph Databases (Neo4j)

Feature Description
Data Model Nodes, Relationships, Properties
Query Language Cypher
Schema Optional (schema-free or schema-optional)
Scaling Causal clustering (community edition limited)
Use Case Social networks, fraud detection, recommendation engines
Strengths Efficient relationship queries, pattern matching

Cypher Query Example:

 MATCH (p:Person)-[:FRIENDS_WITH]->(friend:Person)
 WHERE p.name = "Manupal"
 RETURN friend.name

NoSQL Comparison Summary

Feature MongoDB Cassandra Redis Neo4j
Type Document Column-Family Key-Value Graph
Query Language MQL CQL Redis commands Cypher
Consistency Strong Tunable Strong Strong
Scaling Sharding Peer-to-peer Cluster Limited
Best For Flexible schema Write-heavy Caching Relationships
BASE/ACID ACID (v4.0+) BASE ACID ACID

BASE vs ACID Properties

Property ACID (RDBMS) BASE (NoSQL)
Consistency Strong Eventual
Availability May sacrifice for consistency High priority
Partition Tolerance Limited Built-in
Examples PostgreSQL, MySQL Cassandra, DynamoDB
B - Basically Available
S - Soft state
E - Eventual consistency

5. Stream Processing

Overview

Stream processing handles continuous data streams in real-time, processing data as it arrives rather than in batches.

Apache Kafka

Aspect Description
Type Distributed streaming platform / Message broker
Model Publish-subscribe
Components Topics, Partitions, Brokers, Producers, Consumers, Consumer Groups
Persistence Messages persisted to disk and replicated
Use Case Log aggregation, event sourcing, real-time analytics
Producer → [Topic (Partition 0, Partition 1, ...)] → Consumer Group

Key Concepts:
- Topic: Category/feed name to which messages are published
- Partition: Topics split across brokers for parallelism
- Offset: Unique ID for each message within a partition
- Consumer Group: Multiple consumers work together to process partitions
- ZooKeeper/KRaft: Cluster coordination

Apache Storm

Aspect Description
Type Distributed real-time computation system
Concepts Spouts (sources), Bolts (processing), Topology (DAG)
Guarantee At-least-once processing (Trident: exactly-once)
Latency Extremely low (sub-second)
Use Case Real-time analytics, ETL, continuous monitoring
Aspect Description
Type Stream processing framework with batch capabilities
Model True stream processing (not micro-batch)
Event Time Handles out-of-order events using watermarks
State Managed state for complex aggregations
Guarantee Exactly-once semantics
Use Case Real-time analytics, CEP (complex event processing), fraud detection

Stream Processing Comparison

Feature Kafka Storm Flink
Type Message Broker Stream Processor Stream Processor
Processing Model N/A Micro-batch/True streaming True streaming
Latency Low Very low Low
Throughput Very high Moderate High
Exactly-once Yes (transactions) Trident only Yes (native)
State management Limited Limited Built-in

6. MapReduce Programming Paradigm

Overview

MapReduce is a programming model for processing large datasets in parallel across a distributed cluster.

Two Phases

Input → Split → MAP → SHUFFLE & SORT → REDUCE → Output
Phase Description
Map Process input records, emit key-value pairs (k, v)
Shuffle & Sort Group values by key, transfer to reducers
Reduce Aggregate values for each key, emit final (k, v)

Word Count Example

Input: "hello world hello"

Map Phase:

"hello world hello" → [(hello, 1), (world, 1), (hello, 1)]

Shuffle & Sort:

hello → [1, 1]
world → [1]

Reduce Phase:

hello → [1, 1] → (hello, 2)
world → [1] → (world, 1)

Key Concepts

MapReduce Job Execution (Hadoop)

  1. JobClient submits job to JobTracker
  2. JobTracker assigns tasks to TaskTrackers
  3. Map Tasks read input splits, apply map function
  4. Sort & Shuffle moves intermediate data to reducers
  5. Reduce Tasks aggregate and write final output
  6. Output stored in HDFS

Limitations of MapReduce


7. CAP Theorem

Statement

A distributed data system can provide only two out of three guarantees: Consistency, Availability, and Partition Tolerance.

Three Properties

Property Description
Consistency (C) All nodes see the same data at the same time
Availability (A) Every request receives a response (success or failure)
Partition Tolerance (P) System continues to operate despite network partitions

Trade-offs

         Consistency
            /\
           /  \
          /    \
         /  CP  \
        /        \
       /    CA    \
      /            \
Availability -------- Partition Tolerance
              AP
Combination Description Examples
CA Consistency + Availability (no network partition) Traditional RDBMS (single node)
CP Consistency + Partition Tolerance (sacrifices availability) MongoDB, HBase, Redis Cluster
AP Availability + Partition Tolerance (sacrifices consistency) Cassandra, CouchDB, DynamoDB

Important Notes


8. Lambda vs Kappa Architecture

8.1 Lambda Architecture

Purpose: Handle both batch and real-time data processing

                    ┌──────────────┐
                    │   SERVING    │
                    │    LAYER     │
                    └──────┬───────┘
                           │
              ┌────────────┴────────────┐
              │                         │
     ┌────────▼────────┐      ┌────────▼────────┐
     │   BATCH LAYER   │      │  SPEED LAYER    │
     │ (Hadoop/Spark)  │      │(Storm/Spark     │
     │ Complete,       │      │ Streaming)      │
     │ accurate view   │      │ Real-time,      │
     └────────┬────────┘      │ approximate     │
              │               └────────┬────────┘
              └────────────┬───────────┘
                           │
                    ┌──────▼───────┐
                    │  DATA SOURCE │
                    └──────────────┘

Layers:
1. Batch Layer: Processes all historical data using Hadoop/Spark; produces accurate batch views
2. Speed Layer (Real-time): Processes recent data not yet in batch layer; produces real-time views
3. Serving Layer: Merges batch and real-time views; answers queries

Advantages: Fault tolerant, handles both batch and streaming, immutable data
Disadvantages: Complex (two code bases), double computation, hard to maintain consistency between layers

8.2 Kappa Architecture

Purpose: Simplified alternative using only stream processing

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│     DATA     │────▶│   STREAM     │────▶│   SERVING    │
│    SOURCE    │     │  PROCESSING  │     │    LAYER     │
│  (Kafka)     │     │  (Flink/     │     │              │
└──────────────┘     │   Kafka     │     └──────────────┘
                     │  Streams)   │
                     └──────────────┘

Principles:
- Everything is a stream
- Data is stored in immutable log (e.g., Kafka)
- Batch = replaying historical stream data
- Single code base for all processing

Advantages: Simpler than Lambda, single code base, real-time by default
Disadvantages: Replaying full historical data can be expensive, stream processing can be more complex than batch

Comparison

Feature Lambda Kappa
Complexity High (two code bases) Lower (single code base)
Batch Processing Native Via stream replay
Real-time Separate speed layer Native
Data Storage HDFS + real-time store Event log only
Maintenance Difficult Easier
Use Case When batch is primary When stream is primary

9. Big Data Analytics Tools

Hadoop Ecosystem Tools (covered in Section 2)

Additional Analytics Tools

Tool Category Description
Apache Zeppelin Visualization Web-based notebook for interactive data analytics
Tableau BI/Visualization Drag-and-drop dashboard creation
Power BI BI/Visualization Microsoft's business analytics platform
Apache Superset BI/Open Source Modern data exploration and visualization
Elasticsearch Search/Analytics Distributed search and analytics engine
Kibana Visualization Data visualization for Elasticsearch
Jupyter Notebook Analytics Interactive computing environment for data science
Grafana Monitoring Open-source visualization and monitoring

10. Data Governance and Privacy in Big Data

Data Governance

Framework for managing the availability, usability, integrity, and security of data.

Key Components

Component Description
Data Quality Accuracy, completeness, consistency, timeliness
Data Stewardship Assigning responsibility for data management
Data Catalog Inventory of data sources, definitions, and relationships
Data Lineage Tracking data from source through transformations to consumption
Master Data Management (MDM) Single source of truth for key business entities
Data Security Access controls, encryption, masking, auditing

Data Privacy

Key Challenges

Privacy Techniques

Technique Description
Anonymization Remove PII (Personally Identifiable Information)
Pseudonymization Replace PII with artificial identifiers
Encryption Encrypt data at rest and in transit
Access Control Role-based access (RBAC) to sensitive data
Data Masking Hide sensitive data while maintaining format
Differential Privacy Add statistical noise to query results

Key Regulations

Regulation Region Key Points
GDPR EU Right to be forgotten, consent, data portability
DPDP Act 2023 India Consent-based processing, data fiduciary obligations, right to erasure
CCPA California, US Consumer rights to know, delete, opt-out
HIPAA US Health data protection

India's Digital Personal Data Protection Act (DPDP) 2023 — Key Points


Key Formulas and Theorems Summary

Concept Formula/Statement
CAP Theorem Only 2 of {C, A, P} guaranteed
PACELC If Partition → A or C; Else → L or C
Data Replication Default HDFS replication factor = 3
Consistent Hashing Used for distributed data partitioning

Exam Tips


Practice Questions

11 MCQs for Big Data Analysis with detailed explanations.

Q1. Which of the following best describes Apache Spark?

✅ Correct Answer: Option A

Explanation:
The correct answer is Option A — a fast, general-purpose cluster computing engine. It extends MapReduce with in-memory processing..

This concept is covered under Big Data Analysis in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q2. Regarding the following concept: '- Write Operation: Client requests NameNode, which provides DataNodes; data writ...', which statement is correct?

✅ Correct Answer: Option B

Explanation:
The correct answer is Option B — - Write Operation: Client requests NameNode, which provides DataNodes; data written in pipeline.

This concept is covered under Big Data Analysis in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q3. Regarding the following concept: '- Provides random real-time read/write access to big data...', which statement is correct?

✅ Correct Answer: Option A

Explanation:
The correct answer is Option A — - Provides random real-time read/write access to big data.

This concept is covered under Big Data Analysis in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q4. Regarding the following concept: 'DataNode (Slave):...', which statement is correct?

✅ Correct Answer: Option D

Explanation:
The correct answer is Option D — DataNode (Slave):.

This concept is covered under Big Data Analysis in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q5. Which of the following best describes - Lambda vs Kappa — Lambda has two layers; Kappa?

✅ Correct Answer: Option B

Explanation:
The correct answer is Option B — stream-only.

This concept is covered under Big Data Analysis in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q6. Regarding the following concept: '| Data quality and reliability | Inconsistency, incompleteness, ambiguity, laten...', which statement is correct?

✅ Correct Answer: Option A

Explanation:
The correct answer is Option A — | Data quality and reliability | Inconsistency, incompleteness, ambiguity, latency; "garbage in, garbage out" problem |
|.

This concept is covered under Big Data Analysis in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q7. Which of the following best describes - Brewer's revised view: The "2 out of 3"?

✅ Correct Answer: Option A

Explanation:
The correct answer is Option A — oversimplified; systems can provide partial guarantees of all three.

This concept is covered under Big Data Analysis in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q8. Regarding the following concept: 'Read Operation:...', which statement is correct?

✅ Correct Answer: Option A

Explanation:
The correct answer is Option A — Read Operation:.

This concept is covered under Big Data Analysis in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q9. Regarding the following concept: 'Design Principles:...', which statement is correct?

✅ Correct Answer: Option C

Explanation:
The correct answer is Option C — Design Principles:.

This concept is covered under Big Data Analysis in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q10. Regarding the following concept: 'Secondary NameNode:...', which statement is correct?

✅ Correct Answer: Option C

Explanation:
The correct answer is Option C — Secondary NameNode:.

This concept is covered under Big Data Analysis in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q11. Which of the following best describes NoSQL (Not Only SQL) databases?

✅ Correct Answer: Option C

Explanation:
The correct answer is Option C — non-relational databases designed for large-scale data storage and distributed processing..

This concept is covered under Big Data Analysis in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.