High Performance Computing

Parallel Computing Architectures
Flynn's Taxonomy
Shared Memory vs Distributed Memory Systems
GPU Computing
Cluster Computing
Grid Computing
Cloud Computing for HPC
Parallel Programming Models
Performance Metrics
Load Balancing
Interconnection Networks
Vector Processing
Pipelining in HPC
Fault Tolerance in HPC
Benchmarking
Quantum Computing Basics

1. Parallel Computing Architectures

Definition

Parallel computing is the simultaneous use of multiple processing elements to solve a computational problem, dividing the problem into smaller parts that can be processed concurrently.

Why Parallel Computing?

Physical limits: Single processor speed limited by heat dissipation and clock frequency
Moore's Law: Transistor count doubles ~every 2 years, but single-thread performance gains are slowing
Problem scale: Big data, scientific simulations, AI require massive computation
Cost-effectiveness: Multiple cheaper processors can outperform one expensive processor

Types of Parallelism

Type	Description	Example
Bit-level	Increase processor word size (8→16→32→64 bit)	64-bit processors
Instruction-level	Execute multiple instructions simultaneously	Pipelining, superscalar
Thread-level	Multiple threads execute concurrently	Multi-core processors
Data-level	Same operation on multiple data elements	SIMD, GPU
Task-level	Different tasks execute in parallel	Distributed computing

Architecture Classification

                    PARALLEL ARCHITECTURES
                           │
          ┌────────────────┼────────────────┐
          │                │                │
    Shared Memory    Distributed      Hybrid (Cluster
          │          Memory              of SMPs)
    ┌─────┴─────┐       │
    │           │       │
   UMA        NUMA    MPP
(Symmetric  (Non-   (Massively
Multiproc.) Uniform  Parallel
            Memory   Processing)
            Access)

2. Flynn's Taxonomy

Based on the number of instruction streams and data streams:

Type	Instruction Streams	Data Streams	Description	Example
SISD	Single	Single	Sequential computer; one instruction operates on one data item at a time	Traditional single-core CPU
SIMD	Single	Multiple	One instruction operates on multiple data elements simultaneously	GPU, vector processors, MMX/SSE
MISD	Multiple	Single	Multiple instructions operate on the same data stream	Fault-tolerant systems (rare)
MIMD	Multiple	Multiple	Multiple processors execute different instructions on different data	Multi-core CPUs, clusters, grids

Detailed Explanation

SISD (Single Instruction, Single Data)
- Classical von Neumann architecture
- Sequential execution
- One ALU, one control unit
- Example: Traditional single-processor computer

SIMD (Single Instruction, Multiple Data)
- Data-level parallelism
- Same operation applied to all data elements simultaneously
- Types:
- Array Processor: Each processing element has its own memory
- Vector Processor: Operates on vectors (1D arrays) using pipelined functional units
- Examples: Intel AVX, NVIDIA GPU (warp execution), MMX, SSE, NEON

MISD (Multiple Instruction, Single Data)
- Rarely used in practice
- Multiple processors apply different operations to the same data
- Example: Space shuttle flight control computer (redundant processing for fault tolerance)

MIMD (Multiple Instruction, Multiple Data)
- Most common parallel architecture
- Each processor executes its own program on its own data
- Subtypes:
- Tightly coupled (shared memory): All processors share common memory (SMP, NUMA)
- Loosely coupled (distributed memory): Each processor has its own memory; communicate via message passing (clusters, MPP)

3. Shared Memory vs Distributed Memory Systems

Shared Memory Systems

All processors share a common address space (memory).

┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│  CPU 0  │ │  CPU 1  │ │  CPU 2  │ │  CPU 3  │
└────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘
     │           │           │           │
     └───────────┴─────┬─────┴───────────┘
                       │
                ┌──────▼──────┐
                │   SHARED    │
                │   MEMORY    │
                └─────────────┘

Types:
| Type | Description | Access Time |
|------|-------------|-------------|
| UMA (Uniform Memory Access) | All processors have equal access time to memory | Uniform |
| NUMA (Non-Uniform Memory Access) | Access time varies based on memory location | Non-uniform |
| COMA (Cache-Only Memory Access) | All local memories treated as cache | Varies |

Advantages:
- Simple programming model (shared variables)
- No explicit communication needed
- Easy data sharing

Disadvantages:
- Limited scalability (memory bandwidth bottleneck)
- Cache coherence overhead
- Costly hardware for large systems

Distributed Memory Systems

Each processor has its own private memory; processors communicate via message passing.

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  CPU 0       │    │  CPU 1       │    │  CPU 2       │
│  Local Mem 0 │    │  Local Mem 1 │    │  Local Mem 2 │
└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
       │                   │                   │
       └───────────────────┼───────────────────┘
                           │
                    ┌──────▼──────┐
                    │ INTERCONNECT │
                    │  NETWORK     │
                    └─────────────┘

Advantages:
- Highly scalable
- No memory contention
- Cost-effective (commodity hardware)

Disadvantages:
- Complex programming (explicit message passing)
- Communication overhead
- Data distribution challenges

Comparison

Feature	Shared Memory	Distributed Memory
Memory	Common address space	Private per processor
Communication	Read/Write shared variables	Message passing
Scalability	Limited (typically <100 processors)	High (thousands to millions)
Programming	Easier (OpenMP)	Harder (MPI)
Cost	Higher (specialized hardware)	Lower (commodity)
Cache Coherence	Required	Not applicable
Examples	Multi-core workstation, SMP	Cluster, MPP, grid

4. GPU Computing

GPU Architecture

GPU (Graphics Processing Unit): Massively parallel processor designed for graphics rendering, now widely used for general-purpose computing (GPGPU).

┌─────────────────────────────────────────────┐
│                GPU ARCHITECTURE              │
├─────────────────────────────────────────────┤
│  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐          │
│  │ SM  │ │ SM  │ │ SM  │ │ SM  │  ...     │
│  │(Str-│ │     │ │     │ │     │          │
│  │ eam │ │     │ │     │ │     │          │
│  │Mult-│ │     │ │     │ │     │          │
│  │proc)│ │     │ │     │ │     │          │
│  │32   │ │32   │ │32   │ │32   │          │
│  │CUDA │ │CUDA │ │CUDA │ │CUDA │          │
│  │cores│ │cores│ │cores│ │cores│          │
│  └─────┘ └─────┘ └─────┘ └─────┘          │
├─────────────────────────────────────────────┤
│         SHARED MEMORY / L1 CACHE            │
├─────────────────────────────────────────────┤
│              L2 CACHE                       │
├─────────────────────────────────────────────┤
│         GLOBAL MEMORY (VRAM)                │
└─────────────────────────────────────────────┘

NVIDIA CUDA Architecture

Component	Description
CUDA Core	Single floating-point/integer processing unit
Streaming Multiprocessor (SM)	Group of CUDA cores (32-128 per SM)
Warp	Group of 32 threads executing same instruction (SIMT)
Thread Block	Group of threads executing on same SM
Grid	Collection of thread blocks
Global Memory	GPU's main memory (high bandwidth, high latency)
Shared Memory	On-chip memory shared by threads in a block
Registers	Fastest, private to each thread

CUDA Execution Model

Host (CPU) ──▶ Kernel Launch ──▶ Grid of Thread Blocks
                                      │
                              ┌───────┼───────┐
                              ▼       ▼       ▼
                           Block₀  Block₁  Block₂
                              │       │       │
                           ┌──┴──┐ ┌──┴──┐ ┌──┴──┐
                           │Warp │ │Warp │ │Warp │
                           │0,1..│ │0,1..│ │0,1..│
                           └─────┘ └─────┘ └─────┘

CUDA Memory Hierarchy (Fastest to Slowest)

Registers — per-thread, fastest
Shared Memory — per-block, ~100x faster than global
L1/L2 Cache — automatic caching
Global Memory — all threads, highest latency
Constant/Texture Memory — cached read-only

OpenCL (Open Computing Language)

Open standard for parallel programming across heterogeneous platforms
Works on GPUs (NVIDIA, AMD, Intel), CPUs, FPGAs, DSPs
Platform Model: Host + Devices (Compute Units → Processing Elements)
Execution Model: Kernels executed as NDRange (N-dimensional index space)
Memory Model: Private, Local, Global, Constant memory

CUDA vs OpenCL

Feature	CUDA	OpenCL
Vendor	NVIDIA only	Cross-platform
Hardware	NVIDIA GPUs	GPUs, CPUs, FPGAs, DSPs
Maturity	More mature, better tools	Open standard
Performance	Optimized for NVIDIA	Varies by vendor
Ecosystem	cuDNN, cuBLAS, TensorRT	Standard libraries
Programming	C/C++ extension	C-based kernel language

GPU Computing Applications

Deep learning training and inference
Scientific simulations (molecular dynamics, weather)
Image and video processing
Cryptocurrency mining
Financial modeling
Medical imaging (CT, MRI reconstruction)

5. Cluster Computing

Definition

A cluster is a group of interconnected computers (nodes) that work together as a single system.

Architecture

┌──────────────────────────────────────────────┐
│              CLUSTER ARCHITECTURE             │
├──────────────────────────────────────────────┤
│  ┌────────┐ ┌────────┐ ┌────────┐          │
│  │ Node 0 │ │ Node 1 │ │ Node 2 │  ...     │
│  │(Master)│ │(Worker)│ │(Worker)│          │
│  └───┬────┘ └───┬────┘ └───┬────┘          │
│      │          │          │                │
│      └──────────┼──────────┘                │
│                 │                            │
│          ┌──────▼──────┐                    │
│          │  HIGH-SPEED │                    │
│          │ INTERCONNECT│                    │
│          │(InfiniBand/ │                    │
│          │ Ethernet)   │                    │
│          └─────────────┘                    │
├──────────────────────────────────────────────┤
│  Shared Storage (NFS, Lustre, GPFS)         │
└──────────────────────────────────────────────┘

Types of Clusters

Type	Description	Use Case
High Availability (HA)	Redundant nodes for failover	Mission-critical applications
Load Balancing	Distribute workload across nodes	Web servers, databases
High Performance (HPC)	Parallel processing for computation	Scientific computing, simulation
Storage Cluster	Distributed file systems	Big data, cloud storage

Key Components

Component	Description
Compute Nodes	Worker machines performing computation
Master/Head Node	Manages job scheduling and coordination
Interconnect	High-speed network (InfiniBand, 10/100 GbE)
Storage	Shared parallel file system (Lustre, GPFS, BeeGFS)
Scheduler	Job management (SLURM, PBS, LSF)

India's HPC Initiatives

PARAM series (C-DAC): PARAM Siddhi-AI (rank 63 in TOP500)
NSM (National Supercomputing Mission): Deploying supercomputers across academic/research institutions
Pratyush and Mihir: Weather and climate supercomputers

6. Grid Computing

Definition

Grid computing coordinates geographically distributed, heterogeneous resources across multiple administrative domains to solve large-scale computational problems.

Architecture

┌──────────────────────────────────────────────┐
│              GRID ARCHITECTURE                │
├──────────────────────────────────────────────┤
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │ Site A   │  │ Site B   │  │ Site C   │  │
│  │(Univ.)   │  │(Lab)     │  │(Industry)│  │
│  │Resources │  │Resources │  │Resources │  │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  │
│       │             │             │         │
│       └─────────────┼─────────────┘         │
│                     │                       │
│              ┌──────▼──────┐                │
│              │  GRID       │                │
│              │  MIDDLEWARE │                │
│              │  (Globus,   │                │
│              │   gLite)    │                │
│              └─────────────┘                │
└──────────────────────────────────────────────┘

Grid vs Cluster

Feature	Cluster	Grid
Location	Single location	Geographically distributed
Ownership	Single organization	Multiple organizations
Homogeneity	Homogeneous	Heterogeneous
Network	Dedicated high-speed	Internet/WAN
Coupling	Tightly coupled	Loosely coupled
Middleware	MPI, job scheduler	Globus Toolkit, UNICORE
Use Case	Parallel computing	Collaborative research

Grid Middleware

Globus Toolkit: Security, data management, resource management, information services
gLite: Used in European Grid Infrastructure (EGI)
UNICORE: Uniform Interface to Computing Resources

Grid Projects

SETI@home: Search for extraterrestrial intelligence (volunteer computing)
LHC Computing Grid (WLCG): Process data from Large Hadron Collider
EGI: European Grid Infrastructure
IndiaGrid: Indian grid computing initiative

7. Cloud Computing for HPC

Service Models in HPC Context

Model	Description	HPC Example
IaaS (Infrastructure as a Service)	Virtual machines, storage, networking	AWS EC2 HPC instances, Azure H-series
PaaS (Platform as a Service)	Development platforms, tools	AWS ParallelCluster, Azure CycleCloud
SaaS (Software as a Service)	Ready-to-use applications	Simulation software on cloud (ANSYS Cloud)

HPC in the Cloud

Elasticity: Scale up/down based on workload
Pay-per-use: No upfront hardware investment
Specialized instances: GPU instances (AWS p4d, Azure NCv3), high-memory instances
Challenges: Data transfer costs, latency, security, vendor lock-in

Cloud vs Traditional HPC

Feature	Traditional HPC	Cloud HPC
Cost	High CAPEX	OPEX, pay-per-use
Scalability	Fixed capacity	Elastic
Access	Limited to organization	Global access
Customization	Full control	Limited by provider
Performance	Optimized hardware	Variable (noisy neighbors)
Best For	Steady, large workloads	Bursty, variable workloads

8. Parallel Programming Models

8.1 MPI (Message Passing Interface)

Feature	Description
Type	Message passing library
Memory Model	Distributed memory
Standard	MPI Forum (MPI-1: 1994, MPI-2: 1997, MPI-3: 2012, MPI-4: 2021)
Languages	C, C++, Fortran
Communication	Point-to-point and collective

Key MPI Functions:

MPI_Init(&argc, &argv);           // Initialize MPI
MPI_Comm_size(MPI_COMM_WORLD, &size); // Get number of processes
MPI_Comm_rank(MPI_COMM_WORLD, &rank); // Get process rank

// Point-to-point
MPI_Send(buf, count, datatype, dest, tag, comm);
MPI_Recv(buf, count, datatype, source, tag, comm, &status);

// Collective
MPI_Bcast(buf, count, datatype, root, comm);     // Broadcast
MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm); // Reduce
MPI_Scatter(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm);
MPI_Gather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm);
MPI_Allreduce(sendbuf, recvbuf, count, datatype, op, comm); // All-reduce

MPI_Finalize();                   // Finalize MPI

Communication Patterns:
| Pattern | Description |
|---------|-------------|
| Point-to-point | Send/receive between two processes |
| Broadcast | One process sends to all |
| Scatter | One process distributes data to all |
| Gather | All processes send data to one |
| Reduce | Combine data from all using operation (sum, max, etc.) |
| All-to-all | Every process sends to every other |

8.2 OpenMP (Open Multi-Processing)

Feature	Description
Type	Shared memory parallel programming
Memory Model	Shared memory
Approach	Compiler directives (pragmas)
Languages	C, C++, Fortran
Parallelism	Fork-join model

Key OpenMP Directives:

#include <omp.h>

// Parallel region
#pragma omp parallel
{
    int tid = omp_get_thread_num();
    printf("Hello from thread %d\n", tid);
}

// Parallel for
#pragma omp parallel for
for (int i = 0; i < N; i++) {
    a[i] = b[i] + c[i];
}

// Reduction
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
    sum += a[i];
}

// Critical section
#pragma omp critical
{
    shared_variable++;
}

// Barrier
#pragma omp barrier

OpenMP Clauses:
| Clause | Description |
|--------|-------------|
| shared(var) | Variable shared among all threads |
| private(var) | Each thread has its own copy |
| reduction(op:var) | Combine values using operation |
| schedule(type,chunk) | Control loop iteration distribution |
| num_threads(n) | Set number of threads |
| nowait | Remove implicit barrier |

8.3 MapReduce

(Detailed in Big Data Analysis chapter)

Feature	Description
Type	Distributed data processing model
Phases	Map → Shuffle → Reduce
Framework	Apache Hadoop, Spark
Use Case	Large-scale batch processing

MPI vs OpenMP

Feature	MPI	OpenMP
Memory Model	Distributed	Shared
Scalability	Thousands of nodes	Single node (multi-core)
Programming	Explicit message passing	Compiler directives
Data Transfer	Explicit (send/receive)	Implicit (shared memory)
Fault Tolerance	Process-level	Thread-level
Best For	Large clusters	Multi-core processors

Hybrid MPI+OpenMP

MPI for inter-node communication
OpenMP for intra-node parallelism
Reduces MPI communication overhead
Common in modern HPC applications

9. Performance Metrics

9.1 Speedup

Speedup (S): Ratio of sequential execution time to parallel execution time.

S(p) = T(1) / T(p)

Where:
- T(1) = execution time on single processor
- T(p) = execution time on p processors

Ideal Speedup: S(p) = p (linear speedup)

9.2 Efficiency

Efficiency (E): Measures how effectively processors are utilized.

E(p) = S(p) / p = T(1) / (p × T(p))

E = 1 (100%): Perfect utilization
E < 1: Overhead (communication, synchronization, idle time)

9.3 Amdahl's Law

Statement: The speedup of a program using multiple processors is limited by the sequential fraction of the program.

S(p) = 1 / (f + (1-f)/p)

Where:
- f = fraction of program that is sequential (0 ≤ f ≤ 1)
- p = number of processors
- (1-f) = fraction that can be parallelized

Maximum Speedup: As p → ∞, S(max) = 1/f

Example: If 10% of a program is sequential (f = 0.1):
- Maximum speedup = 1/0.1 = 10x (regardless of number of processors!)
- With 100 processors: S = 1/(0.1 + 0.9/100) = 1/0.109 ≈ 9.17x

Implication: Even small serial fractions severely limit speedup.

9.4 Gustafson's Law

Statement: When problem size scales with the number of processors, speedup can be nearly linear.

S(p) = p - f × (p - 1)

Where:
- f = fraction of time spent in serial code (in parallel execution)
- p = number of processors

Key Insight: Amdahl's Law assumes fixed problem size; Gustafson's Law assumes scaled problem size.

Example: If serial fraction is 5% (f = 0.05) with 100 processors:
- S = 100 - 0.05 × 99 = 100 - 4.95 = 95.05x

9.5 Comparison

Aspect	Amdahl's Law	Gustafson's Law
Problem Size	Fixed	Scales with processors
Serial Fraction	Fixed fraction of total code	Fraction of parallel execution time
Speedup Limit	Bounded by 1/f	Nearly linear
Perspective	Strong scaling	Weak scaling
Realism	Pessimistic	More optimistic

9.6 Other Metrics

Metric	Formula	Description
Throughput	Tasks / Time	Work completed per unit time
Latency	Time per task	Time to complete one task
FLOPS	Floating-point operations per second	Computational speed
Parallel Overhead	T(p) × p - T(1)	Extra work due to parallelism
Scalability	How performance changes with p	Strong vs weak scaling

Strong vs Weak Scaling

Type	Description	Metric
Strong Scaling	Fixed problem size, increase processors	Speedup, efficiency
Weak Scaling	Problem size grows with processors	Efficiency (should stay constant)

10. Load Balancing

Definition

Distributing computational work evenly across all processors to minimize idle time and maximize throughput.

Types

Type	Description	When to Use
Static	Work distributed before execution	Known, uniform workload
Dynamic	Work distributed during execution	Unknown, varying workload

Static Load Balancing

Block Distribution: Divide data into equal contiguous blocks
Cyclic Distribution: Round-robin assignment of work items
Block-Cyclic: Hybrid approach (block size < total/p)

Block:     [0,1,2,3] [4,5,6,7] [8,9,10,11]  → P0, P1, P2
Cyclic:    [0,3,6,9] [1,4,7,10] [2,5,8,11]  → P0, P1, P2
Block-Cyc: [0,1,6,7] [2,3,8,9] [4,5,10,11]  → P0, P1, P2

Dynamic Load Balancing

Centralized: Master distributes work to workers (master-worker model)
Distributed: Processes exchange work among themselves
Work Stealing: Idle processors "steal" work from busy ones
Diffusion: Work migrates from heavily loaded to lightly loaded processors

Challenges

Communication overhead
Load imbalance detection
Migration cost
Synchronization

11. Interconnection Networks

Network Topologies

Topology	Description	Diameter	Bisection Width
Bus	Single shared medium	1	1
Ring	Nodes in a circle	n/2	2
Mesh	2D grid of nodes	2(√n - 1)	√n
Torus	Mesh with wrap-around edges	√n	2√n
Hypercube	n-dimensional cube (2^k nodes)	k = log₂n	n/2
Fat Tree	Tree with increasing bandwidth at higher levels	2log₂n	n/2
Crossbar	Full connectivity switch	1	n/2
Butterfly	Multi-stage interconnection	log₂n	n/2

Hypercube (n-cube)

Nodes: 2^k where k = dimension
Degree: Each node connected to k neighbors
Diameter: k = log₂(n)
Example: 3-cube has 8 nodes, each connected to 3 others

Fat Tree

Multi-level tree with increasing bandwidth at higher levels
Used in: Modern supercomputers (e.g., many TOP500 systems)
Advantage: Non-blocking communication, scalable
Types: k-ary n-tree, butterfly fat tree

Interconnect Technologies

Technology	Bandwidth	Latency	Use Case
InfiniBand HDR	200 Gbps	~0.5 μs	HPC clusters
InfiniBand NDR	400 Gbps	~0.3 μs	Next-gen HPC
Ethernet 100GbE	100 Gbps	~10 μs	General clusters
Omni-Path	100 Gbps	~1 μs	Intel HPC fabric
Cray Slingshot	200 Gbps	~1 μs	Cray/HP systems

12. Vector Processing

Definition

Vector processors operate on one-dimensional arrays (vectors) of data using pipelined functional units.

Architecture

┌──────────────────────────────────────────────┐
│            VECTOR PROCESSOR                   │
├──────────────────────────────────────────────┤
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │ Vector   │  │ Vector   │  │ Vector   │  │
│  │ Register │  │ Register │  │ Register │  │
│  │ V0-V31   │  │ V0-V31   │  │ V0-V31   │  │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  │
│       │             │             │         │
│  ┌────▼────┐  ┌────▼────┐  ┌────▼────┐    │
│  │Pipeline │  │Pipeline │  │Pipeline │    │
│  │  Add    │  │Multiply │  │  Load/  │    │
│  │  Unit   │  │  Unit   │  │  Store  │    │
│  └─────────┘  └─────────┘  └─────────┘    │
└──────────────────────────────────────────────┘

Key Concepts

Vector Length (VL): Number of elements processed simultaneously
Chime: Time to complete one vector operation (pipeline depth)
Strip Mining: Breaking long vectors into chunks that fit vector registers
Vectorization: Converting scalar loops to vector operations

Vector vs Scalar

Feature	Scalar	Vector
Operation	One element at a time	Multiple elements simultaneously
Instruction Count	High (loop overhead)	Low (one vector instruction)
Pipeline Utilization	Lower	Higher (deep pipeline)
Examples	x86 scalar ops	Cray, NEC SX, modern SIMD

Modern Vector Processing

Intel AVX-512: 512-bit SIMD (16 single-precision floats per instruction)
ARM SVE (Scalable Vector Extension): Variable vector length
RISC-V V Extension: Vector extension for RISC-V

13. Pipelining in HPC

Instruction Pipelining

Breaks instruction execution into stages, allowing multiple instructions to be in different stages simultaneously.

Classic 5-Stage Pipeline:

IF → ID → EX → MEM → WB
(Instruction Fetch → Decode → Execute → Memory → Write Back)

Speedup: Ideally equal to number of stages (5x for 5-stage)

Hazards:
| Type | Description | Solution |
|------|-------------|----------|
| Data Hazard | Instruction depends on result of previous | Forwarding, stalling |
| Control Hazard | Branch changes program flow | Branch prediction, delay slots |
| Structural Hazard | Resource conflict | Duplicate resources |

Vector Pipelining

Functional units are pipelined (e.g., FP add takes 6 cycles but accepts new input every cycle)
Chaining: Output of one vector operation fed directly as input to another
Example: Cray-1 had pipelined functional units with chaining

Superpipelining

Divides pipeline into more stages (finer granularity)
Higher clock frequency
More hazards to manage

Superscalar

Multiple instructions issued per clock cycle
Multiple functional units operating in parallel
Dynamic scheduling (out-of-order execution)

14. Fault Tolerance in HPC

Importance

Large systems have high failure rates (thousands of components)
A system with 10,000 nodes and MTBF of 1000 hours fails every 6 minutes on average
Long-running applications need protection against failures

Techniques

Technique	Description	Overhead
Checkpoint/Restart	Periodically save application state; restart from last checkpoint on failure	I/O overhead
Replication	Run multiple copies of tasks	2x+ resource usage
RAID	Redundant disk arrays for data protection	Storage overhead
ECC Memory	Error-correcting code for memory	~12.5% overhead
Process Migration	Move processes from failing nodes	Migration time
Algorithm-Based Fault Tolerance (ABFT)	Encode redundancy in algorithm	Computation overhead

Checkpointing Strategies

Strategy	Description
Coordinated	All processes synchronize and checkpoint together
Uncoordinated	Each process checkpoints independently
Communication-induced	Combine coordinated and uncoordinated
Incremental	Only save changed data since last checkpoint
Multi-level	Use different storage (RAM, SSD, disk) for different checkpoint frequencies

15. Benchmarking

LINPACK

Solves dense system of linear equations (Ax = b)
Measures FLOPS (Floating-point Operations Per Second)
Used for TOP500 supercomputer ranking
HPL (High Performance LINPACK): Standard TOP500 benchmark
Reports: Rmax (achieved), Rpeak (theoretical maximum)

SPEC (Standard Performance Evaluation Corporation)

SPEC CPU: Integer and floating-point performance
SPEC MPI: MPI parallel performance
SPEC ACCEL: Accelerator (GPU) performance
SPECpower: Performance per watt
Results normalized against reference machine

Other Benchmarks

Benchmark	Measures	Use Case
HPCG	Conjugate gradient (memory-bound)	Complements LINPACK
Graph 500	Graph traversal (data-intensive)	Big data analytics
IO-500	I/O bandwidth and IOPS	Storage performance
NAS Parallel Benchmarks	Computational fluid dynamics	Scientific computing
STREAM	Memory bandwidth	Memory subsystem
TPC benchmarks	Database performance	Transaction processing

India's Supercomputers in TOP500

System	Peak Performance	Institution
PARAM Siddhi-AI	5.3 PFLOPS	C-DAC
Pratyush	4.0 PFLOPS	IITM (Weather)
Mihir	2.8 PFLOPS	NCMRWF (Weather)

16. Quantum Computing Basics

Classical vs Quantum Computing

Feature	Classical	Quantum
Basic Unit	Bit (0 or 1)	Qubit (0, 1, or superposition)
State	Definite	Probabilistic
Operations	Logic gates	Quantum gates
Parallelism	Limited	Massive (via superposition)
Best For	General-purpose	Specific problems (factoring, search, simulation)

Key Quantum Concepts

Concept	Description
Qubit	Quantum bit; can be in superposition of
Superposition	Qubit exists in multiple states simultaneously
Entanglement	Qubits correlated; measuring one instantly affects the other
Quantum Gates	Operations on qubits (Hadamard, CNOT, Pauli-X/Y/Z)
Quantum Circuit	Sequence of quantum gates applied to qubits
Measurement	Collapses superposition to definite state (0 or 1)
Decoherence	Loss of quantum properties due to environmental interaction

Quantum Algorithms

Algorithm	Purpose	Speedup
Shor's	Integer factorization	Exponential (breaks RSA)
Grover's	Unstructured search	Quadratic (√N vs N)
VQE	Variational quantum eigensolver	Chemistry simulation
QAOA	Quantum approximate optimization	Combinatorial optimization

Quantum Hardware Approaches

Approach	Company	Qubits
Superconducting	IBM, Google	1000+ (IBM Condor)
Trapped Ion	IonQ, Quantinuum	32+
Photonic	Xanadu, PsiQuantum	Variable
Topological	Microsoft	In development

India's Quantum Initiatives

National Mission on Quantum Technologies (NM-QT): ₹6000 crore (2023)
Quantum Computer Development: ISRO, IISc, RRI involvement
Quantum Key Distribution (QKD): DRDO demonstrated
PARAM Shavak: Quantum computing simulator

Key Formulas Summary

Concept	Formula
Speedup	S(p) = T(1) / T(p)
Efficiency	E(p) = S(p) / p
Amdahl's Law	S = 1 / (f + (1-f)/p)
Gustafson's Law	S = p - f(p-1)
Amdahl's Max Speedup	S_max = 1/f (as p→∞)
FLOPS	Floating-point operations / second
Parallel Overhead	T(p) × p - T(1)
Strong Scaling	Fixed problem, increase processors
Weak Scaling	Problem scales with processors

Exam Tips

Flynn's Taxonomy — SISD, SIMD, MISD, MIMD with examples
Amdahl's Law vs Gustafson's Law — fixed vs scaled problem size
Shared vs Distributed Memory — UMA, NUMA, MPP
MPI vs OpenMP — message passing vs shared memory
CUDA architecture — threads, blocks, grids, warps, memory hierarchy
GPU vs CPU — throughput-oriented vs latency-oriented
Cluster vs Grid — tightly vs loosely coupled
Load balancing — static vs dynamic
Interconnection networks — hypercube, mesh, fat tree
Fault tolerance — checkpoint/restart, ECC, replication
Benchmarking — LINPACK, SPEC, HPCG, Graph 500
Quantum computing — qubit, superposition, entanglement, Shor's and Grover's algorithms
IaaS, PaaS, SaaS in HPC context

Practice Questions

11 MCQs for High Performance Computing with detailed explanations.

Q1. Regarding the following concept: 'Transistor count doubles ~every 2 years, but single-thread performance gains are...', which statement is correct?

- A. Transistor count doubles ~every 2 years, but single-thread performance gains are slowing

B. This concept applies only to analog systems and not digital ones
C. This approach has been deprecated in all modern implementations
D. This is defined exclusively at the physical layer of system design

✅ Correct Answer: Option A

Explanation:
The correct answer is Option A — Transistor count doubles ~every 2 years, but single-thread performance gains are slowing
-.

This concept is covered under High Performance Computing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.

Why other options are incorrect:
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.

Q2. Regarding the following concept: ':

A. This approach has been deprecated in all modern implementations
B. This is defined exclusively at the physical layer of system design
C. This concept applies only to analog systems and not digital ones
D. :

Type	Instruction Streams	Data Streams	Description	Example

✅ Correct Answer: Option D

Explanation:
The correct answer is Option D — :

Type	Instruction Streams	Data Streams	Description	Example
.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.

Q3. Regarding the following concept: 'instruction streams...', which statement is correct?

A. This approach has been deprecated in all modern implementations
B. instruction streams
C. This concept applies only to analog systems and not digital ones
D. This is defined exclusively at the physical layer of system design

✅ Correct Answer: Option B

Explanation:
The correct answer is Option B — instruction streams.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.

Q4. Regarding the following concept: 'Physical limits:...', which statement is correct?

A. This approach has been deprecated in all modern implementations
B. This is defined exclusively at the physical layer of system design
C. This concept applies only to analog systems and not digital ones
D. Physical limits:

✅ Correct Answer: Option D

Explanation:
The correct answer is Option D — Physical limits:.

Q5. Regarding the following concept: 'Big data, scientific simulations, AI require massive computation

-...', which statement is correct?

A. This approach has been deprecated in all modern implementations
B. This concept applies only to analog systems and not digital ones
C. Big data, scientific simulations, AI require massive computation
D. This is defined exclusively at the physical layer of system design

✅ Correct Answer: Option C

Explanation:
The correct answer is Option C — Big data, scientific simulations, AI require massive computation
-.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.

Q6. Regarding the following concept: '- Moore's Law: Transistor count doubles ~every 2 years, but single-thread perfor...', which statement is correct?

A. This approach has been deprecated in all modern implementations
B. This is defined exclusively at the physical layer of system design
C. - Moore's Law: Transistor count doubles ~every 2 years, but single-thread performance gains are slowing
D. This concept applies only to analog systems and not digital ones

✅ Correct Answer: Option C

Explanation:
The correct answer is Option C — - Moore's Law: Transistor count doubles ~every 2 years, but single-thread performance gains are slowing.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.

Q7. Which of the following best describes Parallel computing?

A. the simultaneous use of multiple processing elements to solve a computational problem, dividing the problem into smaller parts that can be processed c
B. sequential (f = 0.1):
C. 5% (f = 0.05) with 100 processors:
D. utilized.

✅ Correct Answer: Option A

Explanation:
The correct answer is Option A — the simultaneous use of multiple processing elements to solve a computational problem, dividing the problem into smaller parts that can be processed c.

Q8. Regarding the following concept: '| Single | Single | Sequential computer; one instruction operates on one data it...', which statement is correct?

A. This concept applies only to analog systems and not digital ones
B. | Single | Single | Sequential computer; one instruction operates on one data item at a time | Traditional single-core CPU |
|
C. This approach has been deprecated in all modern implementations
D. This is defined exclusively at the physical layer of system design

✅ Correct Answer: Option B

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.

Q9. Which of the following best describes Example: If 10% of a program?

A. sequential (f = 0.1):
B. the simultaneous use of multiple processing elements to solve a computational problem, dividing the problem into smaller
C. 5% (f = 0.05) with 100 processors:
D. utilized.

✅ Correct Answer: Option A

Explanation:
The correct answer is Option A — sequential (f = 0.1):.

Q10. Which of the following best describes Example: If serial fraction?

A. the simultaneous use of multiple processing elements to solve a computational problem, dividing the problem into smaller
B. sequential (f = 0.1):
C. 5% (f = 0.05) with 100 processors:
D. utilized.

✅ Correct Answer: Option C

Explanation:
The correct answer is Option C — 5% (f = 0.05) with 100 processors:.

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.

Q11. Which of the following best describes cluster?

A. utilized.
B. a group of interconnected computers (nodes) that work together as a single system.
C. sequential (f = 0.1):
D. the simultaneous use of multiple processing elements to solve a computational problem, dividing the problem into smaller

✅ Correct Answer: Option B

Explanation:
The correct answer is Option B — a group of interconnected computers (nodes) that work together as a single system..

Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.

High Performance Computing

Table of Contents

1. Parallel Computing Architectures

Definition

Why Parallel Computing?

Types of Parallelism

Architecture Classification

2. Flynn's Taxonomy

Detailed Explanation

3. Shared Memory vs Distributed Memory Systems

Shared Memory Systems

Distributed Memory Systems

Comparison

4. GPU Computing

GPU Architecture

NVIDIA CUDA Architecture

CUDA Execution Model

CUDA Memory Hierarchy (Fastest to Slowest)

OpenCL (Open Computing Language)

CUDA vs OpenCL

GPU Computing Applications

5. Cluster Computing

Definition

Architecture

Types of Clusters

Key Components

India's HPC Initiatives

6. Grid Computing

Definition

Architecture

Grid vs Cluster

Grid Middleware

Grid Projects

7. Cloud Computing for HPC

Service Models in HPC Context

HPC in the Cloud

Cloud vs Traditional HPC

8. Parallel Programming Models

8.1 MPI (Message Passing Interface)

8.2 OpenMP (Open Multi-Processing)

8.3 MapReduce

MPI vs OpenMP

Hybrid MPI+OpenMP

9. Performance Metrics

9.1 Speedup

9.2 Efficiency

9.3 Amdahl's Law

9.4 Gustafson's Law

9.5 Comparison

9.6 Other Metrics

Strong vs Weak Scaling

10. Load Balancing

Definition

Types

Static Load Balancing

Dynamic Load Balancing

Challenges

11. Interconnection Networks

Network Topologies

Hypercube (n-cube)

Fat Tree

Interconnect Technologies

12. Vector Processing

Definition

Architecture

Key Concepts

Vector vs Scalar

Modern Vector Processing

13. Pipelining in HPC

Instruction Pipelining

Vector Pipelining

Superpipelining

Superscalar

14. Fault Tolerance in HPC

Importance

Techniques

Checkpointing Strategies

15. Benchmarking

LINPACK

SPEC (Standard Performance Evaluation Corporation)