Big Data Analysis
Table of Contents
- Big Data Characteristics (5 Vs)
- Hadoop Ecosystem
- Spark Architecture
- NoSQL Databases
- Stream Processing
- MapReduce Programming Paradigm
- CAP Theorem
- Lambda vs Kappa Architecture
- Big Data Analytics Tools
- Data Governance and Privacy in Big Data
1. Big Data Characteristics (5 Vs)
Overview
Big Data refers to datasets that are too large or complex for traditional data processing applications. The 5 Vs define its core characteristics:
The 5 Vs
| V | Description | Details |
|---|---|---|
| Volume | Massive scale of data | Terabytes to petabytes; generated from social media, IoT sensors, transactions, logs. Indian govt generates massive data through Aadhaar, GSTN, UPI |
| Velocity | Speed of data generation/processing | Real-time or near real-time streams; need for rapid processing. Example: UPI processes millions of transactions per hour |
| Variety | Different data types and sources | Structured (databases), semi-structured (XML, JSON, logs), unstructured (images, videos, audio, social media posts) |
| Veracity | Data quality and reliability | Inconsistency, incompleteness, ambiguity, latency; "garbage in, garbage out" problem |
| Value | Extractable useful information | Converting raw data into actionable insights; the ultimate goal of big data analytics |
Extended Vs (Additional Characteristics)
- Variability: Inconsistency in data flow rates (seasonal spikes)
- Visualization: Presenting data in a meaningful and understandable format
- Viscosity: Resistance to flow of data between systems
2. Hadoop Ecosystem
Overview
Apache Hadoop is an open-source framework for distributed storage and processing of large datasets on commodity hardware.
Architecture
┌─────────────────────────────────────────────┐
│ HADOOP ECOSYSTEM │
├────────────┬──────────────┬─────────────────┤
│ Storage │ Processing │ Management │
│ │ │ │
│ HDFS │ MapReduce │ YARN │
│ │ Spark │ Oozie │
│ │ Pig │ ZooKeeper │
│ │ Hive │ Ambari │
├────────────┼──────────────┼─────────────────┤
│ Data Access │ Data Transfer │ NoSQL │
│ Hive │ Sqoop │ HBase │
│ Pig │ Flume │ │
│ │ Kafka │ │
└────────────┴──────────────┴─────────────────┘
2.1 HDFS (Hadoop Distributed File System)
Design Principles:
- Store large files (GBs to TBs)
- Streaming data access (write-once, read-many)
- Runs on commodity hardware
Architecture:
- NameNode (Master): Manages file system namespace, metadata, block mapping, replication
- DataNode (Slave): Stores actual data blocks, serves read/write requests
- Secondary NameNode: Periodically merges fsimage and edit logs; NOT a backup for NameNode
Key Features:
- Block size: 128 MB (default) or 256 MB
- Replication factor: 3 (default) — data stored on 3 different nodes
- Rack-aware replication for fault tolerance
- Read Operation: Client gets block locations from NameNode, reads directly from DataNode
- Write Operation: Client requests NameNode, which provides DataNodes; data written in pipeline
HDFS Commands Example:
hdfs dfs -mkdir /user/data
hdfs dfs -put localfile /user/data/
hdfs dfs -cat /user/data/localfile
hdfs dfs -ls /user/data
2.2 MapReduce
(Detailed in Section 6)
Programming model for parallel processing of large datasets on distributed clusters.
2.3 YARN (Yet Another Resource Negotiator)
Purpose: Resource management and job scheduling in Hadoop 2.0+
Components:
| Component | Role |
|-----------|------|
| ResourceManager (RM) | Global resource scheduler; manages cluster resources |
| NodeManager (NM) | Per-node agent; manages containers on a single node |
| ApplicationMaster (AM) | Per-application; negotiates resources and tracks progress |
| Container | Resource allocation (CPU, memory) on a specific node |
Workflow:
1. Client submits application to ResourceManager
2. ResourceManager allocates container and starts ApplicationMaster
3. ApplicationMaster requests containers from ResourceManager
4. NodeManager launches containers and monitors execution
2.4 Hive
- Data warehouse infrastructure built on Hadoop
- HiveQL: SQL-like query language for HDFS data
- Converts queries into MapReduce/Spark/Tez jobs
- Supports partitions and buckets for query optimization
- Metastore: Stores table schemas and metadata
- Use Case: Batch processing, ad-hoc queries, ETL
2.5 Pig
- High-level platform for creating MapReduce programs
- Pig Latin: Data flow language (procedural)
- Supports loading, filtering, joining, grouping, sorting data
- Use Case: ETL pipelines, exploratory data analysis, iterative processing
2.6 HBase
- NoSQL column-oriented database on top of HDFS
- Inspired by Google's BigTable
- Provides random real-time read/write access to big data
- Column-family storage with automatic sharding
- Use Case: When you need real-time read/write access to large datasets (e.g., time-series data, messaging)
2.7 Sqoop (SQL to Hadoop)
- Imports/Exports data between relational databases and Hadoop
- Uses MapReduce under the hood
sqoop import --connect jdbc:mysql://host/db --table employees --target-dir /data/employeessqoop export --connect jdbc:mysql://host/db --table employees --export-dir /data/employees
2.8 Flume
- Distributed service for collecting, aggregating, and moving large log data
- Components: Source → Channel → Sink
- Source: Receives data (e.g., Avro, Kafka, HTTP)
- Channel: Stores data temporarily (memory or file-based)
- Sink: Delivers data to destination (e.g., HDFS, HBase)
2.9 Oozie
- Workflow scheduler for Hadoop jobs
- Manages dependencies between jobs (MapReduce, Pig, Hive, Sqoop)
- Workflow: Directed Acyclic Graph (DAG) of actions
- Coordinator: Schedules workflows based on time or data availability
3. Spark Architecture
Overview
Apache Spark is a fast, general-purpose cluster computing engine. It extends MapReduce with in-memory processing.
Key Advantages over Hadoop MapReduce
| Feature | MapReduce | Spark |
|---|---|---|
| Processing | Disk-based | In-memory (up to 100x faster) |
| Latency | Higher (batch) | Lower (supports streaming) |
| Programming | Map + Reduce only | Rich transformations |
| Ecosystem | Needs separate tools | Unified engine |
| Iterative ML | Poor (repeated disk I/O) | Excellent (caching in memory) |
Spark Architecture
┌──────────────────────────────────────────────┐
│ DRIVER PROGRAM │
│ (SparkContext, Main function, DAG Scheduler)│
└──────────────────┬───────────────────────────┘
│
▼
┌──────────────────────────────────────────────┐
│ CLUSTER MANAGER │
│ (Standalone, YARN, Mesos, Kubernetes) │
└──────────────────┬───────────────────────────┘
│
┌──────────┼──────────┐
▼ ▼ ▼
┌─────────┐┌─────────┐┌─────────┐
│Executor ││Executor ││Executor │
│(Worker) ││(Worker) ││(Worker) │
└─────────┘└─────────┘└─────────┘
Core Components
3.1 RDD (Resilient Distributed Dataset)
- Fundamental data structure of Spark
- Immutable, distributed collection of objects
- Fault-tolerant through lineage (tracks transformations to rebuild lost data)
- Operations:
- Transformations (lazy):
map(), filter(), flatMap(), reduceByKey(), join() - Actions (trigger execution):
count(), collect(), save(), take() - Persistence:
persist()orcache()to keep RDD in memory
3.2 DataFrames
- Distributed collection of data organized into named columns
- Similar to a relational table or a Pandas DataFrame
- Uses Catalyst Optimizer for query optimization
- More efficient than RDDs due to schema awareness
3.3 Spark SQL
- Module for working with structured data
- Supports SQL queries and DataFrame API
- Integrates with Hive metastore
- Example:
spark.sql("SELECT * FROM employees WHERE salary > 50000")
3.4 MLlib (Machine Learning Library)
- Scalable machine learning library
- Algorithms: Classification, regression, clustering, collaborative filtering
- Utilities: Feature extraction, transformation, pipeline construction
- Distributed computation across cluster
3.5 GraphX
- API for graphs and graph-parallel computation
- Built on top of RDDs
- Supports graph algorithms: PageRank, connected components, triangle counting
3.6 Spark Streaming (Structured Streaming)
- Real-time stream processing
- Processes data in micro-batches (or continuous processing)
- Supports sources: Kafka, Flume, HDFS, Socket
- Integration with DataFrames API enables SQL queries on streaming data
Data Source → Spark Streaming → DStream/Batches → Transformations → Output Sink
4. NoSQL Databases
Overview
NoSQL (Not Only SQL) databases are non-relational databases designed for large-scale data storage and distributed processing.
Why NoSQL?
- Handle massive volumes of data (horizontal scaling)
- Flexible schema (schema-less or dynamic schema)
- High availability and fault tolerance
- Better performance for specific use cases (key-value lookups, graph traversals)
Types of NoSQL Databases
4.1 Document Databases (MongoDB)
| Feature | Description |
|---|---|
| Data Model | JSON-like documents (BSON) |
| Schema | Flexible; documents in a collection can have different fields |
| Query | Rich query language, aggregation pipeline |
| Scaling | Horizontal via sharding |
| Use Case | Content management, catalogs, user profiles |
| Strengths | Flexible schema, rich queries, high performance |
MongoDB Example Structure:
{
"_id": ObjectId("507f1f77bcf86cd799439011"),
"name": "Manupal",
"city": "Mumbai",
"skills": ["Python", "Django", "ML"]
}
4.2 Column-Family Databases (Apache Cassandra)
| Feature | Description |
|---|---|
| Data Model | Column families (table-like but columns can vary per row) |
| Partition Key | Determines data distribution across nodes |
| Schema | Flexible within column families |
| Scaling | Linear horizontal scaling, peer-to-peer architecture |
| Use Case | Time-series data, IoT, messaging, write-heavy workloads |
| Strengths | High write throughput, no single point of failure |
Cassandra Architecture:
- Peer-to-peer (no master-slave)
- Tunable consistency (ONE, QUORUM, ALL)
- Data distributed using consistent hashing
- Write path: Commit Log → MemTable → SSTable
4.3 Key-Value Databases (Redis)
| Feature | Description |
|---|---|
| Data Model | Key-value pairs |
| Data Types | Strings, Lists, Sets, Sorted Sets, Hashes, Streams |
| Storage | In-memory (optionally persisted to disk) |
| Scaling | Redis Cluster for horizontal scaling |
| Use Case | Caching, session management, leaderboards, real-time analytics |
| Strengths | Extremely fast (sub-millisecond), versatility |
4.4 Graph Databases (Neo4j)
| Feature | Description |
|---|---|
| Data Model | Nodes, Relationships, Properties |
| Query Language | Cypher |
| Schema | Optional (schema-free or schema-optional) |
| Scaling | Causal clustering (community edition limited) |
| Use Case | Social networks, fraud detection, recommendation engines |
| Strengths | Efficient relationship queries, pattern matching |
Cypher Query Example:
MATCH (p:Person)-[:FRIENDS_WITH]->(friend:Person)
WHERE p.name = "Manupal"
RETURN friend.name
NoSQL Comparison Summary
| Feature | MongoDB | Cassandra | Redis | Neo4j |
|---|---|---|---|---|
| Type | Document | Column-Family | Key-Value | Graph |
| Query Language | MQL | CQL | Redis commands | Cypher |
| Consistency | Strong | Tunable | Strong | Strong |
| Scaling | Sharding | Peer-to-peer | Cluster | Limited |
| Best For | Flexible schema | Write-heavy | Caching | Relationships |
| BASE/ACID | ACID (v4.0+) | BASE | ACID | ACID |
BASE vs ACID Properties
| Property | ACID (RDBMS) | BASE (NoSQL) |
|---|---|---|
| Consistency | Strong | Eventual |
| Availability | May sacrifice for consistency | High priority |
| Partition Tolerance | Limited | Built-in |
| Examples | PostgreSQL, MySQL | Cassandra, DynamoDB |
| B | - | Basically Available |
| S | - | Soft state |
| E | - | Eventual consistency |
5. Stream Processing
Overview
Stream processing handles continuous data streams in real-time, processing data as it arrives rather than in batches.
Apache Kafka
| Aspect | Description |
|---|---|
| Type | Distributed streaming platform / Message broker |
| Model | Publish-subscribe |
| Components | Topics, Partitions, Brokers, Producers, Consumers, Consumer Groups |
| Persistence | Messages persisted to disk and replicated |
| Use Case | Log aggregation, event sourcing, real-time analytics |
Producer → [Topic (Partition 0, Partition 1, ...)] → Consumer Group
Key Concepts:
- Topic: Category/feed name to which messages are published
- Partition: Topics split across brokers for parallelism
- Offset: Unique ID for each message within a partition
- Consumer Group: Multiple consumers work together to process partitions
- ZooKeeper/KRaft: Cluster coordination
Apache Storm
| Aspect | Description |
|---|---|
| Type | Distributed real-time computation system |
| Concepts | Spouts (sources), Bolts (processing), Topology (DAG) |
| Guarantee | At-least-once processing (Trident: exactly-once) |
| Latency | Extremely low (sub-second) |
| Use Case | Real-time analytics, ETL, continuous monitoring |
Apache Flink
| Aspect | Description |
|---|---|
| Type | Stream processing framework with batch capabilities |
| Model | True stream processing (not micro-batch) |
| Event Time | Handles out-of-order events using watermarks |
| State | Managed state for complex aggregations |
| Guarantee | Exactly-once semantics |
| Use Case | Real-time analytics, CEP (complex event processing), fraud detection |
Stream Processing Comparison
| Feature | Kafka | Storm | Flink |
|---|---|---|---|
| Type | Message Broker | Stream Processor | Stream Processor |
| Processing Model | N/A | Micro-batch/True streaming | True streaming |
| Latency | Low | Very low | Low |
| Throughput | Very high | Moderate | High |
| Exactly-once | Yes (transactions) | Trident only | Yes (native) |
| State management | Limited | Limited | Built-in |
6. MapReduce Programming Paradigm
Overview
MapReduce is a programming model for processing large datasets in parallel across a distributed cluster.
Two Phases
Input → Split → MAP → SHUFFLE & SORT → REDUCE → Output
| Phase | Description |
|---|---|
| Map | Process input records, emit key-value pairs (k, v) |
| Shuffle & Sort | Group values by key, transfer to reducers |
| Reduce | Aggregate values for each key, emit final (k, v) |
Word Count Example
Input: "hello world hello"
Map Phase:
"hello world hello" → [(hello, 1), (world, 1), (hello, 1)]
Shuffle & Sort:
hello → [1, 1]
world → [1]
Reduce Phase:
hello → [1, 1] → (hello, 2)
world → [1] → (world, 1)
Key Concepts
- InputFormat: Splits input data and creates key-value pairs for mappers
- Combiner: Mini-reducer that runs on mapper node (optimization)
- Partitioner: Determines which reducer receives which key (default: hash partitioning)
- OutputFormat: Writes final output to storage
MapReduce Job Execution (Hadoop)
- JobClient submits job to JobTracker
- JobTracker assigns tasks to TaskTrackers
- Map Tasks read input splits, apply map function
- Sort & Shuffle moves intermediate data to reducers
- Reduce Tasks aggregate and write final output
- Output stored in HDFS
Limitations of MapReduce
- Multiple disk I/O operations (map output → disk → reduce read)
- Not suitable for iterative algorithms (machine learning)
- Not suitable for real-time processing
- High latency for interactive queries
7. CAP Theorem
Statement
A distributed data system can provide only two out of three guarantees: Consistency, Availability, and Partition Tolerance.
Three Properties
| Property | Description |
|---|---|
| Consistency (C) | All nodes see the same data at the same time |
| Availability (A) | Every request receives a response (success or failure) |
| Partition Tolerance (P) | System continues to operate despite network partitions |
Trade-offs
Consistency
/\
/ \
/ \
/ CP \
/ \
/ CA \
/ \
Availability -------- Partition Tolerance
AP
| Combination | Description | Examples |
|---|---|---|
| CA | Consistency + Availability (no network partition) | Traditional RDBMS (single node) |
| CP | Consistency + Partition Tolerance (sacrifices availability) | MongoDB, HBase, Redis Cluster |
| AP | Availability + Partition Tolerance (sacrifices consistency) | Cassandra, CouchDB, DynamoDB |
Important Notes
- Network partitions are inevitable in distributed systems, so practically the choice is between CP and AP
- Brewer's revised view: The "2 out of 3" is oversimplified; systems can provide partial guarantees of all three
- PACELC Theorem extends CAP: In case of Partition (P), choose between Availability (A) and Consistency (C); Else (E), choose between Latency (L) and Consistency (C)
8. Lambda vs Kappa Architecture
8.1 Lambda Architecture
Purpose: Handle both batch and real-time data processing
┌──────────────┐
│ SERVING │
│ LAYER │
└──────┬───────┘
│
┌────────────┴────────────┐
│ │
┌────────▼────────┐ ┌────────▼────────┐
│ BATCH LAYER │ │ SPEED LAYER │
│ (Hadoop/Spark) │ │(Storm/Spark │
│ Complete, │ │ Streaming) │
│ accurate view │ │ Real-time, │
└────────┬────────┘ │ approximate │
│ └────────┬────────┘
└────────────┬───────────┘
│
┌──────▼───────┐
│ DATA SOURCE │
└──────────────┘
Layers:
1. Batch Layer: Processes all historical data using Hadoop/Spark; produces accurate batch views
2. Speed Layer (Real-time): Processes recent data not yet in batch layer; produces real-time views
3. Serving Layer: Merges batch and real-time views; answers queries
Advantages: Fault tolerant, handles both batch and streaming, immutable data
Disadvantages: Complex (two code bases), double computation, hard to maintain consistency between layers
8.2 Kappa Architecture
Purpose: Simplified alternative using only stream processing
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ DATA │────▶│ STREAM │────▶│ SERVING │
│ SOURCE │ │ PROCESSING │ │ LAYER │
│ (Kafka) │ │ (Flink/ │ │ │
└──────────────┘ │ Kafka │ └──────────────┘
│ Streams) │
└──────────────┘
Principles:
- Everything is a stream
- Data is stored in immutable log (e.g., Kafka)
- Batch = replaying historical stream data
- Single code base for all processing
Advantages: Simpler than Lambda, single code base, real-time by default
Disadvantages: Replaying full historical data can be expensive, stream processing can be more complex than batch
Comparison
| Feature | Lambda | Kappa |
|---|---|---|
| Complexity | High (two code bases) | Lower (single code base) |
| Batch Processing | Native | Via stream replay |
| Real-time | Separate speed layer | Native |
| Data Storage | HDFS + real-time store | Event log only |
| Maintenance | Difficult | Easier |
| Use Case | When batch is primary | When stream is primary |
9. Big Data Analytics Tools
Hadoop Ecosystem Tools (covered in Section 2)
Additional Analytics Tools
| Tool | Category | Description |
|---|---|---|
| Apache Zeppelin | Visualization | Web-based notebook for interactive data analytics |
| Tableau | BI/Visualization | Drag-and-drop dashboard creation |
| Power BI | BI/Visualization | Microsoft's business analytics platform |
| Apache Superset | BI/Open Source | Modern data exploration and visualization |
| Elasticsearch | Search/Analytics | Distributed search and analytics engine |
| Kibana | Visualization | Data visualization for Elasticsearch |
| Jupyter Notebook | Analytics | Interactive computing environment for data science |
| Grafana | Monitoring | Open-source visualization and monitoring |
10. Data Governance and Privacy in Big Data
Data Governance
Framework for managing the availability, usability, integrity, and security of data.
Key Components
| Component | Description |
|---|---|
| Data Quality | Accuracy, completeness, consistency, timeliness |
| Data Stewardship | Assigning responsibility for data management |
| Data Catalog | Inventory of data sources, definitions, and relationships |
| Data Lineage | Tracking data from source through transformations to consumption |
| Master Data Management (MDM) | Single source of truth for key business entities |
| Data Security | Access controls, encryption, masking, auditing |
Data Privacy
Key Challenges
- Volume: More data → more potential for privacy breaches
- Variety: Unstructured data harder to classify and protect
- Velocity: Real-time data harder to anonymize in transit
- Data Proliferation: Data copied across multiple systems
Privacy Techniques
| Technique | Description |
|---|---|
| Anonymization | Remove PII (Personally Identifiable Information) |
| Pseudonymization | Replace PII with artificial identifiers |
| Encryption | Encrypt data at rest and in transit |
| Access Control | Role-based access (RBAC) to sensitive data |
| Data Masking | Hide sensitive data while maintaining format |
| Differential Privacy | Add statistical noise to query results |
Key Regulations
| Regulation | Region | Key Points |
|---|---|---|
| GDPR | EU | Right to be forgotten, consent, data portability |
| DPDP Act 2023 | India | Consent-based processing, data fiduciary obligations, right to erasure |
| CCPA | California, US | Consumer rights to know, delete, opt-out |
| HIPAA | US | Health data protection |
India's Digital Personal Data Protection Act (DPDP) 2023 — Key Points
- Data Principal: Individual whose data is processed
- Data Fiducary: Entity determining purpose and means of processing
- Consent: Freely given, specific, informed, unconditional
- Rights: Right to access, correct, erase, and grievance redressal
- Significant Data Fiduciary: Additional obligations (DPO, DPIA, audit)
- Penalties: Up to ₹250 crore for breaches
Key Formulas and Theorems Summary
| Concept | Formula/Statement |
|---|---|
| CAP Theorem | Only 2 of {C, A, P} guaranteed |
| PACELC | If Partition → A or C; Else → L or C |
| Data Replication | Default HDFS replication factor = 3 |
| Consistent Hashing | Used for distributed data partitioning |
Exam Tips
- 5 Vs of Big Data — memorize and know examples of each
- Hadoop ecosystem — know each component's role
- Spark vs MapReduce — in-memory processing advantage
- NoSQL types — document, column, key-value, graph with examples
- CAP theorem — understand which databases fall into CP vs AP
- Lambda vs Kappa — Lambda has two layers; Kappa is stream-only
- DPDP Act 2023 — India's data protection law, key terms (Data Principal, Data Fiduciary)
- BASE vs ACID — fundamental trade-off in distributed databases
- Kafka — distributed streaming platform, key architectural components
Practice Questions
11 MCQs for Big Data Analysis with detailed explanations.
Q1. Which of the following best describes Apache Spark?
- A. a fast, general-purpose cluster computing engine. It extends MapReduce with in-memory processing.
- B. published
- C. oversimplified; systems can provide partial guarantees of all three
- D. stream-only
✅ Correct Answer: Option A
Explanation:
The correct answer is Option A — a fast, general-purpose cluster computing engine. It extends MapReduce with in-memory processing..
This concept is covered under Big Data Analysis in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.
Why other options are incorrect:
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
Q2. Regarding the following concept: '- Write Operation: Client requests NameNode, which provides DataNodes; data writ...', which statement is correct?
- A. This approach has been deprecated in all modern implementations
- B. - Write Operation: Client requests NameNode, which provides DataNodes; data written in pipeline
- C. This is defined exclusively at the physical layer of system design
- D. This concept applies only to analog systems and not digital ones
✅ Correct Answer: Option B
Explanation:
The correct answer is Option B — - Write Operation: Client requests NameNode, which provides DataNodes; data written in pipeline.
This concept is covered under Big Data Analysis in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.
Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
Q3. Regarding the following concept: '- Provides random real-time read/write access to big data...', which statement is correct?
- A. - Provides random real-time read/write access to big data
- B. This approach has been deprecated in all modern implementations
- C. This concept applies only to analog systems and not digital ones
- D. This is defined exclusively at the physical layer of system design
✅ Correct Answer: Option A
Explanation:
The correct answer is Option A — - Provides random real-time read/write access to big data.
This concept is covered under Big Data Analysis in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.
Why other options are incorrect:
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
Q4. Regarding the following concept: 'DataNode (Slave):...', which statement is correct?
- A. This concept applies only to analog systems and not digital ones
- B. This is defined exclusively at the physical layer of system design
- C. This approach has been deprecated in all modern implementations
- D. DataNode (Slave):
✅ Correct Answer: Option D
Explanation:
The correct answer is Option D — DataNode (Slave):.
This concept is covered under Big Data Analysis in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.
Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
Q5. Which of the following best describes - Lambda vs Kappa — Lambda has two layers; Kappa?
- A. oversimplified; systems can provide partial guarantees of all three
- B. stream-only
- C. published
- D. a fast, general-purpose cluster computing engine. It extends MapReduce with in-memory processing.
✅ Correct Answer: Option B
Explanation:
The correct answer is Option B — stream-only.
This concept is covered under Big Data Analysis in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.
Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
Q6. Regarding the following concept: '| Data quality and reliability | Inconsistency, incompleteness, ambiguity, laten...', which statement is correct?
- A. | Data quality and reliability | Inconsistency, incompleteness, ambiguity, latency; "garbage in, garbage out" problem |
| - B. This concept applies only to analog systems and not digital ones
- C. This approach has been deprecated in all modern implementations
- D. This is defined exclusively at the physical layer of system design
✅ Correct Answer: Option A
Explanation:
The correct answer is Option A — | Data quality and reliability | Inconsistency, incompleteness, ambiguity, latency; "garbage in, garbage out" problem |
|.
This concept is covered under Big Data Analysis in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.
Why other options are incorrect:
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
Q7. Which of the following best describes - Brewer's revised view: The "2 out of 3"?
- A. oversimplified; systems can provide partial guarantees of all three
- B. published
- C. a fast, general-purpose cluster computing engine. It extends MapReduce with in-memory processing.
- D. stream-only
✅ Correct Answer: Option A
Explanation:
The correct answer is Option A — oversimplified; systems can provide partial guarantees of all three.
This concept is covered under Big Data Analysis in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.
Why other options are incorrect:
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
Q8. Regarding the following concept: 'Read Operation:...', which statement is correct?
- A. Read Operation:
- B. This approach has been deprecated in all modern implementations
- C. This is defined exclusively at the physical layer of system design
- D. This concept applies only to analog systems and not digital ones
✅ Correct Answer: Option A
Explanation:
The correct answer is Option A — Read Operation:.
This concept is covered under Big Data Analysis in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.
Why other options are incorrect:
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
Q9. Regarding the following concept: 'Design Principles:...', which statement is correct?
- A. This approach has been deprecated in all modern implementations
- B. This is defined exclusively at the physical layer of system design
- C. Design Principles:
- D. This concept applies only to analog systems and not digital ones
✅ Correct Answer: Option C
Explanation:
The correct answer is Option C — Design Principles:.
This concept is covered under Big Data Analysis in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.
Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
Q10. Regarding the following concept: 'Secondary NameNode:...', which statement is correct?
- A. This approach has been deprecated in all modern implementations
- B. This is defined exclusively at the physical layer of system design
- C. Secondary NameNode:
- D. This concept applies only to analog systems and not digital ones
✅ Correct Answer: Option C
Explanation:
The correct answer is Option C — Secondary NameNode:.
This concept is covered under Big Data Analysis in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.
Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
Q11. Which of the following best describes NoSQL (Not Only SQL) databases?
- A. stream-only
- B. published
- C. non-relational databases designed for large-scale data storage and distributed processing.
- D. a fast, general-purpose cluster computing engine. It extends MapReduce with in-memory processing.
✅ Correct Answer: Option C
Explanation:
The correct answer is Option C — non-relational databases designed for large-scale data storage and distributed processing..
This concept is covered under Big Data Analysis in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.
Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.