High Performance Computing

Table of Contents

  1. Parallel Computing Architectures
  2. Flynn's Taxonomy
  3. Shared Memory vs Distributed Memory Systems
  4. GPU Computing
  5. Cluster Computing
  6. Grid Computing
  7. Cloud Computing for HPC
  8. Parallel Programming Models
  9. Performance Metrics
  10. Load Balancing
  11. Interconnection Networks
  12. Vector Processing
  13. Pipelining in HPC
  14. Fault Tolerance in HPC
  15. Benchmarking
  16. Quantum Computing Basics

1. Parallel Computing Architectures

Definition

Parallel computing is the simultaneous use of multiple processing elements to solve a computational problem, dividing the problem into smaller parts that can be processed concurrently.

Why Parallel Computing?

Types of Parallelism

Type Description Example
Bit-level Increase processor word size (8→16→32→64 bit) 64-bit processors
Instruction-level Execute multiple instructions simultaneously Pipelining, superscalar
Thread-level Multiple threads execute concurrently Multi-core processors
Data-level Same operation on multiple data elements SIMD, GPU
Task-level Different tasks execute in parallel Distributed computing

Architecture Classification

                    PARALLEL ARCHITECTURES
                           │
          ┌────────────────┼────────────────┐
          │                │                │
    Shared Memory    Distributed      Hybrid (Cluster
          │          Memory              of SMPs)
    ┌─────┴─────┐       │
    │           │       │
   UMA        NUMA    MPP
(Symmetric  (Non-   (Massively
Multiproc.) Uniform  Parallel
            Memory   Processing)
            Access)

2. Flynn's Taxonomy

Based on the number of instruction streams and data streams:

Type Instruction Streams Data Streams Description Example
SISD Single Single Sequential computer; one instruction operates on one data item at a time Traditional single-core CPU
SIMD Single Multiple One instruction operates on multiple data elements simultaneously GPU, vector processors, MMX/SSE
MISD Multiple Single Multiple instructions operate on the same data stream Fault-tolerant systems (rare)
MIMD Multiple Multiple Multiple processors execute different instructions on different data Multi-core CPUs, clusters, grids

Detailed Explanation

SISD (Single Instruction, Single Data)
- Classical von Neumann architecture
- Sequential execution
- One ALU, one control unit
- Example: Traditional single-processor computer

SIMD (Single Instruction, Multiple Data)
- Data-level parallelism
- Same operation applied to all data elements simultaneously
- Types:
- Array Processor: Each processing element has its own memory
- Vector Processor: Operates on vectors (1D arrays) using pipelined functional units
- Examples: Intel AVX, NVIDIA GPU (warp execution), MMX, SSE, NEON

MISD (Multiple Instruction, Single Data)
- Rarely used in practice
- Multiple processors apply different operations to the same data
- Example: Space shuttle flight control computer (redundant processing for fault tolerance)

MIMD (Multiple Instruction, Multiple Data)
- Most common parallel architecture
- Each processor executes its own program on its own data
- Subtypes:
- Tightly coupled (shared memory): All processors share common memory (SMP, NUMA)
- Loosely coupled (distributed memory): Each processor has its own memory; communicate via message passing (clusters, MPP)


3. Shared Memory vs Distributed Memory Systems

Shared Memory Systems

All processors share a common address space (memory).

┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│  CPU 0  │ │  CPU 1  │ │  CPU 2  │ │  CPU 3  │
└────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘
     │           │           │           │
     └───────────┴─────┬─────┴───────────┘
                       │
                ┌──────▼──────┐
                │   SHARED    │
                │   MEMORY    │
                └─────────────┘

Types:
| Type | Description | Access Time |
|------|-------------|-------------|
| UMA (Uniform Memory Access) | All processors have equal access time to memory | Uniform |
| NUMA (Non-Uniform Memory Access) | Access time varies based on memory location | Non-uniform |
| COMA (Cache-Only Memory Access) | All local memories treated as cache | Varies |

Advantages:
- Simple programming model (shared variables)
- No explicit communication needed
- Easy data sharing

Disadvantages:
- Limited scalability (memory bandwidth bottleneck)
- Cache coherence overhead
- Costly hardware for large systems

Distributed Memory Systems

Each processor has its own private memory; processors communicate via message passing.

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  CPU 0       │    │  CPU 1       │    │  CPU 2       │
│  Local Mem 0 │    │  Local Mem 1 │    │  Local Mem 2 │
└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
       │                   │                   │
       └───────────────────┼───────────────────┘
                           │
                    ┌──────▼──────┐
                    │ INTERCONNECT │
                    │  NETWORK     │
                    └─────────────┘

Advantages:
- Highly scalable
- No memory contention
- Cost-effective (commodity hardware)

Disadvantages:
- Complex programming (explicit message passing)
- Communication overhead
- Data distribution challenges

Comparison

Feature Shared Memory Distributed Memory
Memory Common address space Private per processor
Communication Read/Write shared variables Message passing
Scalability Limited (typically <100 processors) High (thousands to millions)
Programming Easier (OpenMP) Harder (MPI)
Cost Higher (specialized hardware) Lower (commodity)
Cache Coherence Required Not applicable
Examples Multi-core workstation, SMP Cluster, MPP, grid

4. GPU Computing

GPU Architecture

GPU (Graphics Processing Unit): Massively parallel processor designed for graphics rendering, now widely used for general-purpose computing (GPGPU).

┌─────────────────────────────────────────────┐
│                GPU ARCHITECTURE              │
├─────────────────────────────────────────────┤
│  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐          │
│  │ SM  │ │ SM  │ │ SM  │ │ SM  │  ...     │
│  │(Str-│ │     │ │     │ │     │          │
│  │ eam │ │     │ │     │ │     │          │
│  │Mult-│ │     │ │     │ │     │          │
│  │proc)│ │     │ │     │ │     │          │
│  │32   │ │32   │ │32   │ │32   │          │
│  │CUDA │ │CUDA │ │CUDA │ │CUDA │          │
│  │cores│ │cores│ │cores│ │cores│          │
│  └─────┘ └─────┘ └─────┘ └─────┘          │
├─────────────────────────────────────────────┤
│         SHARED MEMORY / L1 CACHE            │
├─────────────────────────────────────────────┤
│              L2 CACHE                       │
├─────────────────────────────────────────────┤
│         GLOBAL MEMORY (VRAM)                │
└─────────────────────────────────────────────┘

NVIDIA CUDA Architecture

Component Description
CUDA Core Single floating-point/integer processing unit
Streaming Multiprocessor (SM) Group of CUDA cores (32-128 per SM)
Warp Group of 32 threads executing same instruction (SIMT)
Thread Block Group of threads executing on same SM
Grid Collection of thread blocks
Global Memory GPU's main memory (high bandwidth, high latency)
Shared Memory On-chip memory shared by threads in a block
Registers Fastest, private to each thread

CUDA Execution Model

Host (CPU) ──▶ Kernel Launch ──▶ Grid of Thread Blocks
                                      │
                              ┌───────┼───────┐
                              ▼       ▼       ▼
                           Block₀  Block₁  Block₂
                              │       │       │
                           ┌──┴──┐ ┌──┴──┐ ┌──┴──┐
                           │Warp │ │Warp │ │Warp │
                           │0,1..│ │0,1..│ │0,1..│
                           └─────┘ └─────┘ └─────┘

CUDA Memory Hierarchy (Fastest to Slowest)

  1. Registers — per-thread, fastest
  2. Shared Memory — per-block, ~100x faster than global
  3. L1/L2 Cache — automatic caching
  4. Global Memory — all threads, highest latency
  5. Constant/Texture Memory — cached read-only

OpenCL (Open Computing Language)

CUDA vs OpenCL

Feature CUDA OpenCL
Vendor NVIDIA only Cross-platform
Hardware NVIDIA GPUs GPUs, CPUs, FPGAs, DSPs
Maturity More mature, better tools Open standard
Performance Optimized for NVIDIA Varies by vendor
Ecosystem cuDNN, cuBLAS, TensorRT Standard libraries
Programming C/C++ extension C-based kernel language

GPU Computing Applications


5. Cluster Computing

Definition

A cluster is a group of interconnected computers (nodes) that work together as a single system.

Architecture

┌──────────────────────────────────────────────┐
│              CLUSTER ARCHITECTURE             │
├──────────────────────────────────────────────┤
│  ┌────────┐ ┌────────┐ ┌────────┐          │
│  │ Node 0 │ │ Node 1 │ │ Node 2 │  ...     │
│  │(Master)│ │(Worker)│ │(Worker)│          │
│  └───┬────┘ └───┬────┘ └───┬────┘          │
│      │          │          │                │
│      └──────────┼──────────┘                │
│                 │                            │
│          ┌──────▼──────┐                    │
│          │  HIGH-SPEED │                    │
│          │ INTERCONNECT│                    │
│          │(InfiniBand/ │                    │
│          │ Ethernet)   │                    │
│          └─────────────┘                    │
├──────────────────────────────────────────────┤
│  Shared Storage (NFS, Lustre, GPFS)         │
└──────────────────────────────────────────────┘

Types of Clusters

Type Description Use Case
High Availability (HA) Redundant nodes for failover Mission-critical applications
Load Balancing Distribute workload across nodes Web servers, databases
High Performance (HPC) Parallel processing for computation Scientific computing, simulation
Storage Cluster Distributed file systems Big data, cloud storage

Key Components

Component Description
Compute Nodes Worker machines performing computation
Master/Head Node Manages job scheduling and coordination
Interconnect High-speed network (InfiniBand, 10/100 GbE)
Storage Shared parallel file system (Lustre, GPFS, BeeGFS)
Scheduler Job management (SLURM, PBS, LSF)

India's HPC Initiatives


6. Grid Computing

Definition

Grid computing coordinates geographically distributed, heterogeneous resources across multiple administrative domains to solve large-scale computational problems.

Architecture

┌──────────────────────────────────────────────┐
│              GRID ARCHITECTURE                │
├──────────────────────────────────────────────┤
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │ Site A   │  │ Site B   │  │ Site C   │  │
│  │(Univ.)   │  │(Lab)     │  │(Industry)│  │
│  │Resources │  │Resources │  │Resources │  │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  │
│       │             │             │         │
│       └─────────────┼─────────────┘         │
│                     │                       │
│              ┌──────▼──────┐                │
│              │  GRID       │                │
│              │  MIDDLEWARE │                │
│              │  (Globus,   │                │
│              │   gLite)    │                │
│              └─────────────┘                │
└──────────────────────────────────────────────┘

Grid vs Cluster

Feature Cluster Grid
Location Single location Geographically distributed
Ownership Single organization Multiple organizations
Homogeneity Homogeneous Heterogeneous
Network Dedicated high-speed Internet/WAN
Coupling Tightly coupled Loosely coupled
Middleware MPI, job scheduler Globus Toolkit, UNICORE
Use Case Parallel computing Collaborative research

Grid Middleware

Grid Projects


7. Cloud Computing for HPC

Service Models in HPC Context

Model Description HPC Example
IaaS (Infrastructure as a Service) Virtual machines, storage, networking AWS EC2 HPC instances, Azure H-series
PaaS (Platform as a Service) Development platforms, tools AWS ParallelCluster, Azure CycleCloud
SaaS (Software as a Service) Ready-to-use applications Simulation software on cloud (ANSYS Cloud)

HPC in the Cloud

Cloud vs Traditional HPC

Feature Traditional HPC Cloud HPC
Cost High CAPEX OPEX, pay-per-use
Scalability Fixed capacity Elastic
Access Limited to organization Global access
Customization Full control Limited by provider
Performance Optimized hardware Variable (noisy neighbors)
Best For Steady, large workloads Bursty, variable workloads

8. Parallel Programming Models

8.1 MPI (Message Passing Interface)

Feature Description
Type Message passing library
Memory Model Distributed memory
Standard MPI Forum (MPI-1: 1994, MPI-2: 1997, MPI-3: 2012, MPI-4: 2021)
Languages C, C++, Fortran
Communication Point-to-point and collective

Key MPI Functions:

MPI_Init(&argc, &argv);           // Initialize MPI
MPI_Comm_size(MPI_COMM_WORLD, &size); // Get number of processes
MPI_Comm_rank(MPI_COMM_WORLD, &rank); // Get process rank

// Point-to-point
MPI_Send(buf, count, datatype, dest, tag, comm);
MPI_Recv(buf, count, datatype, source, tag, comm, &status);

// Collective
MPI_Bcast(buf, count, datatype, root, comm);     // Broadcast
MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm); // Reduce
MPI_Scatter(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm);
MPI_Gather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm);
MPI_Allreduce(sendbuf, recvbuf, count, datatype, op, comm); // All-reduce

MPI_Finalize();                   // Finalize MPI

Communication Patterns:
| Pattern | Description |
|---------|-------------|
| Point-to-point | Send/receive between two processes |
| Broadcast | One process sends to all |
| Scatter | One process distributes data to all |
| Gather | All processes send data to one |
| Reduce | Combine data from all using operation (sum, max, etc.) |
| All-to-all | Every process sends to every other |

8.2 OpenMP (Open Multi-Processing)

Feature Description
Type Shared memory parallel programming
Memory Model Shared memory
Approach Compiler directives (pragmas)
Languages C, C++, Fortran
Parallelism Fork-join model

Key OpenMP Directives:

#include <omp.h>

// Parallel region
#pragma omp parallel
{
    int tid = omp_get_thread_num();
    printf("Hello from thread %d\n", tid);
}

// Parallel for
#pragma omp parallel for
for (int i = 0; i < N; i++) {
    a[i] = b[i] + c[i];
}

// Reduction
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
    sum += a[i];
}

// Critical section
#pragma omp critical
{
    shared_variable++;
}

// Barrier
#pragma omp barrier

OpenMP Clauses:
| Clause | Description |
|--------|-------------|
| shared(var) | Variable shared among all threads |
| private(var) | Each thread has its own copy |
| reduction(op:var) | Combine values using operation |
| schedule(type,chunk) | Control loop iteration distribution |
| num_threads(n) | Set number of threads |
| nowait | Remove implicit barrier |

8.3 MapReduce

(Detailed in Big Data Analysis chapter)

Feature Description
Type Distributed data processing model
Phases Map → Shuffle → Reduce
Framework Apache Hadoop, Spark
Use Case Large-scale batch processing

MPI vs OpenMP

Feature MPI OpenMP
Memory Model Distributed Shared
Scalability Thousands of nodes Single node (multi-core)
Programming Explicit message passing Compiler directives
Data Transfer Explicit (send/receive) Implicit (shared memory)
Fault Tolerance Process-level Thread-level
Best For Large clusters Multi-core processors

Hybrid MPI+OpenMP


9. Performance Metrics

9.1 Speedup

Speedup (S): Ratio of sequential execution time to parallel execution time.

S(p) = T(1) / T(p)

Where:
- T(1) = execution time on single processor
- T(p) = execution time on p processors

Ideal Speedup: S(p) = p (linear speedup)

9.2 Efficiency

Efficiency (E): Measures how effectively processors are utilized.

E(p) = S(p) / p = T(1) / (p × T(p))

9.3 Amdahl's Law

Statement: The speedup of a program using multiple processors is limited by the sequential fraction of the program.

S(p) = 1 / (f + (1-f)/p)

Where:
- f = fraction of program that is sequential (0 ≤ f ≤ 1)
- p = number of processors
- (1-f) = fraction that can be parallelized

Maximum Speedup: As p → ∞, S(max) = 1/f

Example: If 10% of a program is sequential (f = 0.1):
- Maximum speedup = 1/0.1 = 10x (regardless of number of processors!)
- With 100 processors: S = 1/(0.1 + 0.9/100) = 1/0.109 ≈ 9.17x

Implication: Even small serial fractions severely limit speedup.

9.4 Gustafson's Law

Statement: When problem size scales with the number of processors, speedup can be nearly linear.

S(p) = p - f × (p - 1)

Where:
- f = fraction of time spent in serial code (in parallel execution)
- p = number of processors

Key Insight: Amdahl's Law assumes fixed problem size; Gustafson's Law assumes scaled problem size.

Example: If serial fraction is 5% (f = 0.05) with 100 processors:
- S = 100 - 0.05 × 99 = 100 - 4.95 = 95.05x

9.5 Comparison

Aspect Amdahl's Law Gustafson's Law
Problem Size Fixed Scales with processors
Serial Fraction Fixed fraction of total code Fraction of parallel execution time
Speedup Limit Bounded by 1/f Nearly linear
Perspective Strong scaling Weak scaling
Realism Pessimistic More optimistic

9.6 Other Metrics

Metric Formula Description
Throughput Tasks / Time Work completed per unit time
Latency Time per task Time to complete one task
FLOPS Floating-point operations per second Computational speed
Parallel Overhead T(p) × p - T(1) Extra work due to parallelism
Scalability How performance changes with p Strong vs weak scaling

Strong vs Weak Scaling

Type Description Metric
Strong Scaling Fixed problem size, increase processors Speedup, efficiency
Weak Scaling Problem size grows with processors Efficiency (should stay constant)

10. Load Balancing

Definition

Distributing computational work evenly across all processors to minimize idle time and maximize throughput.

Types

Type Description When to Use
Static Work distributed before execution Known, uniform workload
Dynamic Work distributed during execution Unknown, varying workload

Static Load Balancing

Block:     [0,1,2,3] [4,5,6,7] [8,9,10,11]  → P0, P1, P2
Cyclic:    [0,3,6,9] [1,4,7,10] [2,5,8,11]  → P0, P1, P2
Block-Cyc: [0,1,6,7] [2,3,8,9] [4,5,10,11]  → P0, P1, P2

Dynamic Load Balancing

Challenges


11. Interconnection Networks

Network Topologies

Topology Description Diameter Bisection Width
Bus Single shared medium 1 1
Ring Nodes in a circle n/2 2
Mesh 2D grid of nodes 2(√n - 1) √n
Torus Mesh with wrap-around edges √n 2√n
Hypercube n-dimensional cube (2^k nodes) k = log₂n n/2
Fat Tree Tree with increasing bandwidth at higher levels 2log₂n n/2
Crossbar Full connectivity switch 1 n/2
Butterfly Multi-stage interconnection log₂n n/2

Hypercube (n-cube)

Fat Tree

Interconnect Technologies

Technology Bandwidth Latency Use Case
InfiniBand HDR 200 Gbps ~0.5 μs HPC clusters
InfiniBand NDR 400 Gbps ~0.3 μs Next-gen HPC
Ethernet 100GbE 100 Gbps ~10 μs General clusters
Omni-Path 100 Gbps ~1 μs Intel HPC fabric
Cray Slingshot 200 Gbps ~1 μs Cray/HP systems

12. Vector Processing

Definition

Vector processors operate on one-dimensional arrays (vectors) of data using pipelined functional units.

Architecture

┌──────────────────────────────────────────────┐
│            VECTOR PROCESSOR                   │
├──────────────────────────────────────────────┤
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │ Vector   │  │ Vector   │  │ Vector   │  │
│  │ Register │  │ Register │  │ Register │  │
│  │ V0-V31   │  │ V0-V31   │  │ V0-V31   │  │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  │
│       │             │             │         │
│  ┌────▼────┐  ┌────▼────┐  ┌────▼────┐    │
│  │Pipeline │  │Pipeline │  │Pipeline │    │
│  │  Add    │  │Multiply │  │  Load/  │    │
│  │  Unit   │  │  Unit   │  │  Store  │    │
│  └─────────┘  └─────────┘  └─────────┘    │
└──────────────────────────────────────────────┘

Key Concepts

Vector vs Scalar

Feature Scalar Vector
Operation One element at a time Multiple elements simultaneously
Instruction Count High (loop overhead) Low (one vector instruction)
Pipeline Utilization Lower Higher (deep pipeline)
Examples x86 scalar ops Cray, NEC SX, modern SIMD

Modern Vector Processing


13. Pipelining in HPC

Instruction Pipelining

Breaks instruction execution into stages, allowing multiple instructions to be in different stages simultaneously.

Classic 5-Stage Pipeline:

IF  ID  EX  MEM  WB
(Instruction Fetch  Decode  Execute  Memory  Write Back)

Speedup: Ideally equal to number of stages (5x for 5-stage)

Hazards:
| Type | Description | Solution |
|------|-------------|----------|
| Data Hazard | Instruction depends on result of previous | Forwarding, stalling |
| Control Hazard | Branch changes program flow | Branch prediction, delay slots |
| Structural Hazard | Resource conflict | Duplicate resources |

Vector Pipelining

Superpipelining

Superscalar


14. Fault Tolerance in HPC

Importance

Techniques

Technique Description Overhead
Checkpoint/Restart Periodically save application state; restart from last checkpoint on failure I/O overhead
Replication Run multiple copies of tasks 2x+ resource usage
RAID Redundant disk arrays for data protection Storage overhead
ECC Memory Error-correcting code for memory ~12.5% overhead
Process Migration Move processes from failing nodes Migration time
Algorithm-Based Fault Tolerance (ABFT) Encode redundancy in algorithm Computation overhead

Checkpointing Strategies

Strategy Description
Coordinated All processes synchronize and checkpoint together
Uncoordinated Each process checkpoints independently
Communication-induced Combine coordinated and uncoordinated
Incremental Only save changed data since last checkpoint
Multi-level Use different storage (RAM, SSD, disk) for different checkpoint frequencies

15. Benchmarking

LINPACK

SPEC (Standard Performance Evaluation Corporation)

Other Benchmarks

Benchmark Measures Use Case
HPCG Conjugate gradient (memory-bound) Complements LINPACK
Graph 500 Graph traversal (data-intensive) Big data analytics
IO-500 I/O bandwidth and IOPS Storage performance
NAS Parallel Benchmarks Computational fluid dynamics Scientific computing
STREAM Memory bandwidth Memory subsystem
TPC benchmarks Database performance Transaction processing

India's Supercomputers in TOP500

System Peak Performance Institution
PARAM Siddhi-AI 5.3 PFLOPS C-DAC
Pratyush 4.0 PFLOPS IITM (Weather)
Mihir 2.8 PFLOPS NCMRWF (Weather)

16. Quantum Computing Basics

Classical vs Quantum Computing

Feature Classical Quantum
Basic Unit Bit (0 or 1) Qubit (0, 1, or superposition)
State Definite Probabilistic
Operations Logic gates Quantum gates
Parallelism Limited Massive (via superposition)
Best For General-purpose Specific problems (factoring, search, simulation)

Key Quantum Concepts

Concept Description
Qubit Quantum bit; can be in superposition of
Superposition Qubit exists in multiple states simultaneously
Entanglement Qubits correlated; measuring one instantly affects the other
Quantum Gates Operations on qubits (Hadamard, CNOT, Pauli-X/Y/Z)
Quantum Circuit Sequence of quantum gates applied to qubits
Measurement Collapses superposition to definite state (0 or 1)
Decoherence Loss of quantum properties due to environmental interaction

Quantum Algorithms

Algorithm Purpose Speedup
Shor's Integer factorization Exponential (breaks RSA)
Grover's Unstructured search Quadratic (√N vs N)
VQE Variational quantum eigensolver Chemistry simulation
QAOA Quantum approximate optimization Combinatorial optimization

Quantum Hardware Approaches

Approach Company Qubits
Superconducting IBM, Google 1000+ (IBM Condor)
Trapped Ion IonQ, Quantinuum 32+
Photonic Xanadu, PsiQuantum Variable
Topological Microsoft In development

India's Quantum Initiatives


Key Formulas Summary

Concept Formula
Speedup S(p) = T(1) / T(p)
Efficiency E(p) = S(p) / p
Amdahl's Law S = 1 / (f + (1-f)/p)
Gustafson's Law S = p - f(p-1)
Amdahl's Max Speedup S_max = 1/f (as p→∞)
FLOPS Floating-point operations / second
Parallel Overhead T(p) × p - T(1)
Strong Scaling Fixed problem, increase processors
Weak Scaling Problem scales with processors

Exam Tips


Practice Questions

11 MCQs for High Performance Computing with detailed explanations.

Q1. Regarding the following concept: 'Transistor count doubles ~every 2 years, but single-thread performance gains are...', which statement is correct?

- A. Transistor count doubles ~every 2 years, but single-thread performance gains are slowing

✅ Correct Answer: Option A

Explanation:
The correct answer is Option A — Transistor count doubles ~every 2 years, but single-thread performance gains are slowing
-.

This concept is covered under High Performance Computing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q2. Regarding the following concept: ':

| Type | Instruction Streams | Data Streams | Description | Example |
|------...', which statement is correct?

Type Instruction Streams Data Streams Description Example

✅ Correct Answer: Option D

Explanation:
The correct answer is Option D — :

Type Instruction Streams Data Streams Description Example
.

This concept is covered under High Performance Computing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q3. Regarding the following concept: 'instruction streams...', which statement is correct?

✅ Correct Answer: Option B

Explanation:
The correct answer is Option B — instruction streams.

This concept is covered under High Performance Computing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q4. Regarding the following concept: 'Physical limits:...', which statement is correct?

✅ Correct Answer: Option D

Explanation:
The correct answer is Option D — Physical limits:.

This concept is covered under High Performance Computing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q5. Regarding the following concept: 'Big data, scientific simulations, AI require massive computation

-...', which statement is correct?

✅ Correct Answer: Option C

Explanation:
The correct answer is Option C — Big data, scientific simulations, AI require massive computation
-.

This concept is covered under High Performance Computing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q6. Regarding the following concept: '- Moore's Law: Transistor count doubles ~every 2 years, but single-thread perfor...', which statement is correct?

✅ Correct Answer: Option C

Explanation:
The correct answer is Option C — - Moore's Law: Transistor count doubles ~every 2 years, but single-thread performance gains are slowing.

This concept is covered under High Performance Computing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q7. Which of the following best describes Parallel computing?

✅ Correct Answer: Option A

Explanation:
The correct answer is Option A — the simultaneous use of multiple processing elements to solve a computational problem, dividing the problem into smaller parts that can be processed c.

This concept is covered under High Performance Computing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q8. Regarding the following concept: '| Single | Single | Sequential computer; one instruction operates on one data it...', which statement is correct?

✅ Correct Answer: Option B

Explanation:
The correct answer is Option B — | Single | Single | Sequential computer; one instruction operates on one data item at a time | Traditional single-core CPU |
|.

This concept is covered under High Performance Computing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q9. Which of the following best describes Example: If 10% of a program?

✅ Correct Answer: Option A

Explanation:
The correct answer is Option A — sequential (f = 0.1):.

This concept is covered under High Performance Computing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q10. Which of the following best describes Example: If serial fraction?

✅ Correct Answer: Option C

Explanation:
The correct answer is Option C — 5% (f = 0.05) with 100 processors:.

This concept is covered under High Performance Computing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.


Q11. Which of the following best describes cluster?

✅ Correct Answer: Option B

Explanation:
The correct answer is Option B — a group of interconnected computers (nodes) that work together as a single system..

This concept is covered under High Performance Computing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.