High Performance Computing
Table of Contents
- Parallel Computing Architectures
- Flynn's Taxonomy
- Shared Memory vs Distributed Memory Systems
- GPU Computing
- Cluster Computing
- Grid Computing
- Cloud Computing for HPC
- Parallel Programming Models
- Performance Metrics
- Load Balancing
- Interconnection Networks
- Vector Processing
- Pipelining in HPC
- Fault Tolerance in HPC
- Benchmarking
- Quantum Computing Basics
1. Parallel Computing Architectures
Definition
Parallel computing is the simultaneous use of multiple processing elements to solve a computational problem, dividing the problem into smaller parts that can be processed concurrently.
Why Parallel Computing?
- Physical limits: Single processor speed limited by heat dissipation and clock frequency
- Moore's Law: Transistor count doubles ~every 2 years, but single-thread performance gains are slowing
- Problem scale: Big data, scientific simulations, AI require massive computation
- Cost-effectiveness: Multiple cheaper processors can outperform one expensive processor
Types of Parallelism
| Type | Description | Example |
|---|---|---|
| Bit-level | Increase processor word size (8→16→32→64 bit) | 64-bit processors |
| Instruction-level | Execute multiple instructions simultaneously | Pipelining, superscalar |
| Thread-level | Multiple threads execute concurrently | Multi-core processors |
| Data-level | Same operation on multiple data elements | SIMD, GPU |
| Task-level | Different tasks execute in parallel | Distributed computing |
Architecture Classification
PARALLEL ARCHITECTURES
│
┌────────────────┼────────────────┐
│ │ │
Shared Memory Distributed Hybrid (Cluster
│ Memory of SMPs)
┌─────┴─────┐ │
│ │ │
UMA NUMA MPP
(Symmetric (Non- (Massively
Multiproc.) Uniform Parallel
Memory Processing)
Access)
2. Flynn's Taxonomy
Based on the number of instruction streams and data streams:
| Type | Instruction Streams | Data Streams | Description | Example |
|---|---|---|---|---|
| SISD | Single | Single | Sequential computer; one instruction operates on one data item at a time | Traditional single-core CPU |
| SIMD | Single | Multiple | One instruction operates on multiple data elements simultaneously | GPU, vector processors, MMX/SSE |
| MISD | Multiple | Single | Multiple instructions operate on the same data stream | Fault-tolerant systems (rare) |
| MIMD | Multiple | Multiple | Multiple processors execute different instructions on different data | Multi-core CPUs, clusters, grids |
Detailed Explanation
SISD (Single Instruction, Single Data)
- Classical von Neumann architecture
- Sequential execution
- One ALU, one control unit
- Example: Traditional single-processor computer
SIMD (Single Instruction, Multiple Data)
- Data-level parallelism
- Same operation applied to all data elements simultaneously
- Types:
- Array Processor: Each processing element has its own memory
- Vector Processor: Operates on vectors (1D arrays) using pipelined functional units
- Examples: Intel AVX, NVIDIA GPU (warp execution), MMX, SSE, NEON
MISD (Multiple Instruction, Single Data)
- Rarely used in practice
- Multiple processors apply different operations to the same data
- Example: Space shuttle flight control computer (redundant processing for fault tolerance)
MIMD (Multiple Instruction, Multiple Data)
- Most common parallel architecture
- Each processor executes its own program on its own data
- Subtypes:
- Tightly coupled (shared memory): All processors share common memory (SMP, NUMA)
- Loosely coupled (distributed memory): Each processor has its own memory; communicate via message passing (clusters, MPP)
3. Shared Memory vs Distributed Memory Systems
Shared Memory Systems
All processors share a common address space (memory).
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ CPU 0 │ │ CPU 1 │ │ CPU 2 │ │ CPU 3 │
└────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘
│ │ │ │
└───────────┴─────┬─────┴───────────┘
│
┌──────▼──────┐
│ SHARED │
│ MEMORY │
└─────────────┘
Types:
| Type | Description | Access Time |
|------|-------------|-------------|
| UMA (Uniform Memory Access) | All processors have equal access time to memory | Uniform |
| NUMA (Non-Uniform Memory Access) | Access time varies based on memory location | Non-uniform |
| COMA (Cache-Only Memory Access) | All local memories treated as cache | Varies |
Advantages:
- Simple programming model (shared variables)
- No explicit communication needed
- Easy data sharing
Disadvantages:
- Limited scalability (memory bandwidth bottleneck)
- Cache coherence overhead
- Costly hardware for large systems
Distributed Memory Systems
Each processor has its own private memory; processors communicate via message passing.
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ CPU 0 │ │ CPU 1 │ │ CPU 2 │
│ Local Mem 0 │ │ Local Mem 1 │ │ Local Mem 2 │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
└───────────────────┼───────────────────┘
│
┌──────▼──────┐
│ INTERCONNECT │
│ NETWORK │
└─────────────┘
Advantages:
- Highly scalable
- No memory contention
- Cost-effective (commodity hardware)
Disadvantages:
- Complex programming (explicit message passing)
- Communication overhead
- Data distribution challenges
Comparison
| Feature | Shared Memory | Distributed Memory |
|---|---|---|
| Memory | Common address space | Private per processor |
| Communication | Read/Write shared variables | Message passing |
| Scalability | Limited (typically <100 processors) | High (thousands to millions) |
| Programming | Easier (OpenMP) | Harder (MPI) |
| Cost | Higher (specialized hardware) | Lower (commodity) |
| Cache Coherence | Required | Not applicable |
| Examples | Multi-core workstation, SMP | Cluster, MPP, grid |
4. GPU Computing
GPU Architecture
GPU (Graphics Processing Unit): Massively parallel processor designed for graphics rendering, now widely used for general-purpose computing (GPGPU).
┌─────────────────────────────────────────────┐
│ GPU ARCHITECTURE │
├─────────────────────────────────────────────┤
│ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │ SM │ │ SM │ │ SM │ │ SM │ ... │
│ │(Str-│ │ │ │ │ │ │ │
│ │ eam │ │ │ │ │ │ │ │
│ │Mult-│ │ │ │ │ │ │ │
│ │proc)│ │ │ │ │ │ │ │
│ │32 │ │32 │ │32 │ │32 │ │
│ │CUDA │ │CUDA │ │CUDA │ │CUDA │ │
│ │cores│ │cores│ │cores│ │cores│ │
│ └─────┘ └─────┘ └─────┘ └─────┘ │
├─────────────────────────────────────────────┤
│ SHARED MEMORY / L1 CACHE │
├─────────────────────────────────────────────┤
│ L2 CACHE │
├─────────────────────────────────────────────┤
│ GLOBAL MEMORY (VRAM) │
└─────────────────────────────────────────────┘
NVIDIA CUDA Architecture
| Component | Description |
|---|---|
| CUDA Core | Single floating-point/integer processing unit |
| Streaming Multiprocessor (SM) | Group of CUDA cores (32-128 per SM) |
| Warp | Group of 32 threads executing same instruction (SIMT) |
| Thread Block | Group of threads executing on same SM |
| Grid | Collection of thread blocks |
| Global Memory | GPU's main memory (high bandwidth, high latency) |
| Shared Memory | On-chip memory shared by threads in a block |
| Registers | Fastest, private to each thread |
CUDA Execution Model
Host (CPU) ──▶ Kernel Launch ──▶ Grid of Thread Blocks
│
┌───────┼───────┐
▼ ▼ ▼
Block₀ Block₁ Block₂
│ │ │
┌──┴──┐ ┌──┴──┐ ┌──┴──┐
│Warp │ │Warp │ │Warp │
│0,1..│ │0,1..│ │0,1..│
└─────┘ └─────┘ └─────┘
CUDA Memory Hierarchy (Fastest to Slowest)
- Registers — per-thread, fastest
- Shared Memory — per-block, ~100x faster than global
- L1/L2 Cache — automatic caching
- Global Memory — all threads, highest latency
- Constant/Texture Memory — cached read-only
OpenCL (Open Computing Language)
- Open standard for parallel programming across heterogeneous platforms
- Works on GPUs (NVIDIA, AMD, Intel), CPUs, FPGAs, DSPs
- Platform Model: Host + Devices (Compute Units → Processing Elements)
- Execution Model: Kernels executed as NDRange (N-dimensional index space)
- Memory Model: Private, Local, Global, Constant memory
CUDA vs OpenCL
| Feature | CUDA | OpenCL |
|---|---|---|
| Vendor | NVIDIA only | Cross-platform |
| Hardware | NVIDIA GPUs | GPUs, CPUs, FPGAs, DSPs |
| Maturity | More mature, better tools | Open standard |
| Performance | Optimized for NVIDIA | Varies by vendor |
| Ecosystem | cuDNN, cuBLAS, TensorRT | Standard libraries |
| Programming | C/C++ extension | C-based kernel language |
GPU Computing Applications
- Deep learning training and inference
- Scientific simulations (molecular dynamics, weather)
- Image and video processing
- Cryptocurrency mining
- Financial modeling
- Medical imaging (CT, MRI reconstruction)
5. Cluster Computing
Definition
A cluster is a group of interconnected computers (nodes) that work together as a single system.
Architecture
┌──────────────────────────────────────────────┐
│ CLUSTER ARCHITECTURE │
├──────────────────────────────────────────────┤
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │ Node 0 │ │ Node 1 │ │ Node 2 │ ... │
│ │(Master)│ │(Worker)│ │(Worker)│ │
│ └───┬────┘ └───┬────┘ └───┬────┘ │
│ │ │ │ │
│ └──────────┼──────────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ HIGH-SPEED │ │
│ │ INTERCONNECT│ │
│ │(InfiniBand/ │ │
│ │ Ethernet) │ │
│ └─────────────┘ │
├──────────────────────────────────────────────┤
│ Shared Storage (NFS, Lustre, GPFS) │
└──────────────────────────────────────────────┘
Types of Clusters
| Type | Description | Use Case |
|---|---|---|
| High Availability (HA) | Redundant nodes for failover | Mission-critical applications |
| Load Balancing | Distribute workload across nodes | Web servers, databases |
| High Performance (HPC) | Parallel processing for computation | Scientific computing, simulation |
| Storage Cluster | Distributed file systems | Big data, cloud storage |
Key Components
| Component | Description |
|---|---|
| Compute Nodes | Worker machines performing computation |
| Master/Head Node | Manages job scheduling and coordination |
| Interconnect | High-speed network (InfiniBand, 10/100 GbE) |
| Storage | Shared parallel file system (Lustre, GPFS, BeeGFS) |
| Scheduler | Job management (SLURM, PBS, LSF) |
India's HPC Initiatives
- PARAM series (C-DAC): PARAM Siddhi-AI (rank 63 in TOP500)
- NSM (National Supercomputing Mission): Deploying supercomputers across academic/research institutions
- Pratyush and Mihir: Weather and climate supercomputers
6. Grid Computing
Definition
Grid computing coordinates geographically distributed, heterogeneous resources across multiple administrative domains to solve large-scale computational problems.
Architecture
┌──────────────────────────────────────────────┐
│ GRID ARCHITECTURE │
├──────────────────────────────────────────────┤
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Site A │ │ Site B │ │ Site C │ │
│ │(Univ.) │ │(Lab) │ │(Industry)│ │
│ │Resources │ │Resources │ │Resources │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
│ └─────────────┼─────────────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ GRID │ │
│ │ MIDDLEWARE │ │
│ │ (Globus, │ │
│ │ gLite) │ │
│ └─────────────┘ │
└──────────────────────────────────────────────┘
Grid vs Cluster
| Feature | Cluster | Grid |
|---|---|---|
| Location | Single location | Geographically distributed |
| Ownership | Single organization | Multiple organizations |
| Homogeneity | Homogeneous | Heterogeneous |
| Network | Dedicated high-speed | Internet/WAN |
| Coupling | Tightly coupled | Loosely coupled |
| Middleware | MPI, job scheduler | Globus Toolkit, UNICORE |
| Use Case | Parallel computing | Collaborative research |
Grid Middleware
- Globus Toolkit: Security, data management, resource management, information services
- gLite: Used in European Grid Infrastructure (EGI)
- UNICORE: Uniform Interface to Computing Resources
Grid Projects
- SETI@home: Search for extraterrestrial intelligence (volunteer computing)
- LHC Computing Grid (WLCG): Process data from Large Hadron Collider
- EGI: European Grid Infrastructure
- IndiaGrid: Indian grid computing initiative
7. Cloud Computing for HPC
Service Models in HPC Context
| Model | Description | HPC Example |
|---|---|---|
| IaaS (Infrastructure as a Service) | Virtual machines, storage, networking | AWS EC2 HPC instances, Azure H-series |
| PaaS (Platform as a Service) | Development platforms, tools | AWS ParallelCluster, Azure CycleCloud |
| SaaS (Software as a Service) | Ready-to-use applications | Simulation software on cloud (ANSYS Cloud) |
HPC in the Cloud
- Elasticity: Scale up/down based on workload
- Pay-per-use: No upfront hardware investment
- Specialized instances: GPU instances (AWS p4d, Azure NCv3), high-memory instances
- Challenges: Data transfer costs, latency, security, vendor lock-in
Cloud vs Traditional HPC
| Feature | Traditional HPC | Cloud HPC |
|---|---|---|
| Cost | High CAPEX | OPEX, pay-per-use |
| Scalability | Fixed capacity | Elastic |
| Access | Limited to organization | Global access |
| Customization | Full control | Limited by provider |
| Performance | Optimized hardware | Variable (noisy neighbors) |
| Best For | Steady, large workloads | Bursty, variable workloads |
8. Parallel Programming Models
8.1 MPI (Message Passing Interface)
| Feature | Description |
|---|---|
| Type | Message passing library |
| Memory Model | Distributed memory |
| Standard | MPI Forum (MPI-1: 1994, MPI-2: 1997, MPI-3: 2012, MPI-4: 2021) |
| Languages | C, C++, Fortran |
| Communication | Point-to-point and collective |
Key MPI Functions:
MPI_Init(&argc, &argv); // Initialize MPI
MPI_Comm_size(MPI_COMM_WORLD, &size); // Get number of processes
MPI_Comm_rank(MPI_COMM_WORLD, &rank); // Get process rank
// Point-to-point
MPI_Send(buf, count, datatype, dest, tag, comm);
MPI_Recv(buf, count, datatype, source, tag, comm, &status);
// Collective
MPI_Bcast(buf, count, datatype, root, comm); // Broadcast
MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm); // Reduce
MPI_Scatter(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm);
MPI_Gather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm);
MPI_Allreduce(sendbuf, recvbuf, count, datatype, op, comm); // All-reduce
MPI_Finalize(); // Finalize MPI
Communication Patterns:
| Pattern | Description |
|---------|-------------|
| Point-to-point | Send/receive between two processes |
| Broadcast | One process sends to all |
| Scatter | One process distributes data to all |
| Gather | All processes send data to one |
| Reduce | Combine data from all using operation (sum, max, etc.) |
| All-to-all | Every process sends to every other |
8.2 OpenMP (Open Multi-Processing)
| Feature | Description |
|---|---|
| Type | Shared memory parallel programming |
| Memory Model | Shared memory |
| Approach | Compiler directives (pragmas) |
| Languages | C, C++, Fortran |
| Parallelism | Fork-join model |
Key OpenMP Directives:
#include <omp.h>
// Parallel region
#pragma omp parallel
{
int tid = omp_get_thread_num();
printf("Hello from thread %d\n", tid);
}
// Parallel for
#pragma omp parallel for
for (int i = 0; i < N; i++) {
a[i] = b[i] + c[i];
}
// Reduction
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
sum += a[i];
}
// Critical section
#pragma omp critical
{
shared_variable++;
}
// Barrier
#pragma omp barrier
OpenMP Clauses:
| Clause | Description |
|--------|-------------|
| shared(var) | Variable shared among all threads |
| private(var) | Each thread has its own copy |
| reduction(op:var) | Combine values using operation |
| schedule(type,chunk) | Control loop iteration distribution |
| num_threads(n) | Set number of threads |
| nowait | Remove implicit barrier |
8.3 MapReduce
(Detailed in Big Data Analysis chapter)
| Feature | Description |
|---|---|
| Type | Distributed data processing model |
| Phases | Map → Shuffle → Reduce |
| Framework | Apache Hadoop, Spark |
| Use Case | Large-scale batch processing |
MPI vs OpenMP
| Feature | MPI | OpenMP |
|---|---|---|
| Memory Model | Distributed | Shared |
| Scalability | Thousands of nodes | Single node (multi-core) |
| Programming | Explicit message passing | Compiler directives |
| Data Transfer | Explicit (send/receive) | Implicit (shared memory) |
| Fault Tolerance | Process-level | Thread-level |
| Best For | Large clusters | Multi-core processors |
Hybrid MPI+OpenMP
- MPI for inter-node communication
- OpenMP for intra-node parallelism
- Reduces MPI communication overhead
- Common in modern HPC applications
9. Performance Metrics
9.1 Speedup
Speedup (S): Ratio of sequential execution time to parallel execution time.
S(p) = T(1) / T(p)
Where:
- T(1) = execution time on single processor
- T(p) = execution time on p processors
Ideal Speedup: S(p) = p (linear speedup)
9.2 Efficiency
Efficiency (E): Measures how effectively processors are utilized.
E(p) = S(p) / p = T(1) / (p × T(p))
- E = 1 (100%): Perfect utilization
- E < 1: Overhead (communication, synchronization, idle time)
9.3 Amdahl's Law
Statement: The speedup of a program using multiple processors is limited by the sequential fraction of the program.
S(p) = 1 / (f + (1-f)/p)
Where:
- f = fraction of program that is sequential (0 ≤ f ≤ 1)
- p = number of processors
- (1-f) = fraction that can be parallelized
Maximum Speedup: As p → ∞, S(max) = 1/f
Example: If 10% of a program is sequential (f = 0.1):
- Maximum speedup = 1/0.1 = 10x (regardless of number of processors!)
- With 100 processors: S = 1/(0.1 + 0.9/100) = 1/0.109 ≈ 9.17x
Implication: Even small serial fractions severely limit speedup.
9.4 Gustafson's Law
Statement: When problem size scales with the number of processors, speedup can be nearly linear.
S(p) = p - f × (p - 1)
Where:
- f = fraction of time spent in serial code (in parallel execution)
- p = number of processors
Key Insight: Amdahl's Law assumes fixed problem size; Gustafson's Law assumes scaled problem size.
Example: If serial fraction is 5% (f = 0.05) with 100 processors:
- S = 100 - 0.05 × 99 = 100 - 4.95 = 95.05x
9.5 Comparison
| Aspect | Amdahl's Law | Gustafson's Law |
|---|---|---|
| Problem Size | Fixed | Scales with processors |
| Serial Fraction | Fixed fraction of total code | Fraction of parallel execution time |
| Speedup Limit | Bounded by 1/f | Nearly linear |
| Perspective | Strong scaling | Weak scaling |
| Realism | Pessimistic | More optimistic |
9.6 Other Metrics
| Metric | Formula | Description |
|---|---|---|
| Throughput | Tasks / Time | Work completed per unit time |
| Latency | Time per task | Time to complete one task |
| FLOPS | Floating-point operations per second | Computational speed |
| Parallel Overhead | T(p) × p - T(1) | Extra work due to parallelism |
| Scalability | How performance changes with p | Strong vs weak scaling |
Strong vs Weak Scaling
| Type | Description | Metric |
|---|---|---|
| Strong Scaling | Fixed problem size, increase processors | Speedup, efficiency |
| Weak Scaling | Problem size grows with processors | Efficiency (should stay constant) |
10. Load Balancing
Definition
Distributing computational work evenly across all processors to minimize idle time and maximize throughput.
Types
| Type | Description | When to Use |
|---|---|---|
| Static | Work distributed before execution | Known, uniform workload |
| Dynamic | Work distributed during execution | Unknown, varying workload |
Static Load Balancing
- Block Distribution: Divide data into equal contiguous blocks
- Cyclic Distribution: Round-robin assignment of work items
- Block-Cyclic: Hybrid approach (block size < total/p)
Block: [0,1,2,3] [4,5,6,7] [8,9,10,11] → P0, P1, P2
Cyclic: [0,3,6,9] [1,4,7,10] [2,5,8,11] → P0, P1, P2
Block-Cyc: [0,1,6,7] [2,3,8,9] [4,5,10,11] → P0, P1, P2
Dynamic Load Balancing
- Centralized: Master distributes work to workers (master-worker model)
- Distributed: Processes exchange work among themselves
- Work Stealing: Idle processors "steal" work from busy ones
- Diffusion: Work migrates from heavily loaded to lightly loaded processors
Challenges
- Communication overhead
- Load imbalance detection
- Migration cost
- Synchronization
11. Interconnection Networks
Network Topologies
| Topology | Description | Diameter | Bisection Width |
|---|---|---|---|
| Bus | Single shared medium | 1 | 1 |
| Ring | Nodes in a circle | n/2 | 2 |
| Mesh | 2D grid of nodes | 2(√n - 1) | √n |
| Torus | Mesh with wrap-around edges | √n | 2√n |
| Hypercube | n-dimensional cube (2^k nodes) | k = log₂n | n/2 |
| Fat Tree | Tree with increasing bandwidth at higher levels | 2log₂n | n/2 |
| Crossbar | Full connectivity switch | 1 | n/2 |
| Butterfly | Multi-stage interconnection | log₂n | n/2 |
Hypercube (n-cube)
- Nodes: 2^k where k = dimension
- Degree: Each node connected to k neighbors
- Diameter: k = log₂(n)
- Example: 3-cube has 8 nodes, each connected to 3 others
Fat Tree
- Multi-level tree with increasing bandwidth at higher levels
- Used in: Modern supercomputers (e.g., many TOP500 systems)
- Advantage: Non-blocking communication, scalable
- Types: k-ary n-tree, butterfly fat tree
Interconnect Technologies
| Technology | Bandwidth | Latency | Use Case |
|---|---|---|---|
| InfiniBand HDR | 200 Gbps | ~0.5 μs | HPC clusters |
| InfiniBand NDR | 400 Gbps | ~0.3 μs | Next-gen HPC |
| Ethernet 100GbE | 100 Gbps | ~10 μs | General clusters |
| Omni-Path | 100 Gbps | ~1 μs | Intel HPC fabric |
| Cray Slingshot | 200 Gbps | ~1 μs | Cray/HP systems |
12. Vector Processing
Definition
Vector processors operate on one-dimensional arrays (vectors) of data using pipelined functional units.
Architecture
┌──────────────────────────────────────────────┐
│ VECTOR PROCESSOR │
├──────────────────────────────────────────────┤
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Vector │ │ Vector │ │ Vector │ │
│ │ Register │ │ Register │ │ Register │ │
│ │ V0-V31 │ │ V0-V31 │ │ V0-V31 │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
│ ┌────▼────┐ ┌────▼────┐ ┌────▼────┐ │
│ │Pipeline │ │Pipeline │ │Pipeline │ │
│ │ Add │ │Multiply │ │ Load/ │ │
│ │ Unit │ │ Unit │ │ Store │ │
│ └─────────┘ └─────────┘ └─────────┘ │
└──────────────────────────────────────────────┘
Key Concepts
- Vector Length (VL): Number of elements processed simultaneously
- Chime: Time to complete one vector operation (pipeline depth)
- Strip Mining: Breaking long vectors into chunks that fit vector registers
- Vectorization: Converting scalar loops to vector operations
Vector vs Scalar
| Feature | Scalar | Vector |
|---|---|---|
| Operation | One element at a time | Multiple elements simultaneously |
| Instruction Count | High (loop overhead) | Low (one vector instruction) |
| Pipeline Utilization | Lower | Higher (deep pipeline) |
| Examples | x86 scalar ops | Cray, NEC SX, modern SIMD |
Modern Vector Processing
- Intel AVX-512: 512-bit SIMD (16 single-precision floats per instruction)
- ARM SVE (Scalable Vector Extension): Variable vector length
- RISC-V V Extension: Vector extension for RISC-V
13. Pipelining in HPC
Instruction Pipelining
Breaks instruction execution into stages, allowing multiple instructions to be in different stages simultaneously.
Classic 5-Stage Pipeline:
IF → ID → EX → MEM → WB
(Instruction Fetch → Decode → Execute → Memory → Write Back)
Speedup: Ideally equal to number of stages (5x for 5-stage)
Hazards:
| Type | Description | Solution |
|------|-------------|----------|
| Data Hazard | Instruction depends on result of previous | Forwarding, stalling |
| Control Hazard | Branch changes program flow | Branch prediction, delay slots |
| Structural Hazard | Resource conflict | Duplicate resources |
Vector Pipelining
- Functional units are pipelined (e.g., FP add takes 6 cycles but accepts new input every cycle)
- Chaining: Output of one vector operation fed directly as input to another
- Example: Cray-1 had pipelined functional units with chaining
Superpipelining
- Divides pipeline into more stages (finer granularity)
- Higher clock frequency
- More hazards to manage
Superscalar
- Multiple instructions issued per clock cycle
- Multiple functional units operating in parallel
- Dynamic scheduling (out-of-order execution)
14. Fault Tolerance in HPC
Importance
- Large systems have high failure rates (thousands of components)
- A system with 10,000 nodes and MTBF of 1000 hours fails every 6 minutes on average
- Long-running applications need protection against failures
Techniques
| Technique | Description | Overhead |
|---|---|---|
| Checkpoint/Restart | Periodically save application state; restart from last checkpoint on failure | I/O overhead |
| Replication | Run multiple copies of tasks | 2x+ resource usage |
| RAID | Redundant disk arrays for data protection | Storage overhead |
| ECC Memory | Error-correcting code for memory | ~12.5% overhead |
| Process Migration | Move processes from failing nodes | Migration time |
| Algorithm-Based Fault Tolerance (ABFT) | Encode redundancy in algorithm | Computation overhead |
Checkpointing Strategies
| Strategy | Description |
|---|---|
| Coordinated | All processes synchronize and checkpoint together |
| Uncoordinated | Each process checkpoints independently |
| Communication-induced | Combine coordinated and uncoordinated |
| Incremental | Only save changed data since last checkpoint |
| Multi-level | Use different storage (RAM, SSD, disk) for different checkpoint frequencies |
15. Benchmarking
LINPACK
- Solves dense system of linear equations (Ax = b)
- Measures FLOPS (Floating-point Operations Per Second)
- Used for TOP500 supercomputer ranking
- HPL (High Performance LINPACK): Standard TOP500 benchmark
- Reports: Rmax (achieved), Rpeak (theoretical maximum)
SPEC (Standard Performance Evaluation Corporation)
- SPEC CPU: Integer and floating-point performance
- SPEC MPI: MPI parallel performance
- SPEC ACCEL: Accelerator (GPU) performance
- SPECpower: Performance per watt
- Results normalized against reference machine
Other Benchmarks
| Benchmark | Measures | Use Case |
|---|---|---|
| HPCG | Conjugate gradient (memory-bound) | Complements LINPACK |
| Graph 500 | Graph traversal (data-intensive) | Big data analytics |
| IO-500 | I/O bandwidth and IOPS | Storage performance |
| NAS Parallel Benchmarks | Computational fluid dynamics | Scientific computing |
| STREAM | Memory bandwidth | Memory subsystem |
| TPC benchmarks | Database performance | Transaction processing |
India's Supercomputers in TOP500
| System | Peak Performance | Institution |
|---|---|---|
| PARAM Siddhi-AI | 5.3 PFLOPS | C-DAC |
| Pratyush | 4.0 PFLOPS | IITM (Weather) |
| Mihir | 2.8 PFLOPS | NCMRWF (Weather) |
16. Quantum Computing Basics
Classical vs Quantum Computing
| Feature | Classical | Quantum |
|---|---|---|
| Basic Unit | Bit (0 or 1) | Qubit (0, 1, or superposition) |
| State | Definite | Probabilistic |
| Operations | Logic gates | Quantum gates |
| Parallelism | Limited | Massive (via superposition) |
| Best For | General-purpose | Specific problems (factoring, search, simulation) |
Key Quantum Concepts
| Concept | Description |
|---|---|
| Qubit | Quantum bit; can be in superposition of |
| Superposition | Qubit exists in multiple states simultaneously |
| Entanglement | Qubits correlated; measuring one instantly affects the other |
| Quantum Gates | Operations on qubits (Hadamard, CNOT, Pauli-X/Y/Z) |
| Quantum Circuit | Sequence of quantum gates applied to qubits |
| Measurement | Collapses superposition to definite state (0 or 1) |
| Decoherence | Loss of quantum properties due to environmental interaction |
Quantum Algorithms
| Algorithm | Purpose | Speedup |
|---|---|---|
| Shor's | Integer factorization | Exponential (breaks RSA) |
| Grover's | Unstructured search | Quadratic (√N vs N) |
| VQE | Variational quantum eigensolver | Chemistry simulation |
| QAOA | Quantum approximate optimization | Combinatorial optimization |
Quantum Hardware Approaches
| Approach | Company | Qubits |
|---|---|---|
| Superconducting | IBM, Google | 1000+ (IBM Condor) |
| Trapped Ion | IonQ, Quantinuum | 32+ |
| Photonic | Xanadu, PsiQuantum | Variable |
| Topological | Microsoft | In development |
India's Quantum Initiatives
- National Mission on Quantum Technologies (NM-QT): ₹6000 crore (2023)
- Quantum Computer Development: ISRO, IISc, RRI involvement
- Quantum Key Distribution (QKD): DRDO demonstrated
- PARAM Shavak: Quantum computing simulator
Key Formulas Summary
| Concept | Formula |
|---|---|
| Speedup | S(p) = T(1) / T(p) |
| Efficiency | E(p) = S(p) / p |
| Amdahl's Law | S = 1 / (f + (1-f)/p) |
| Gustafson's Law | S = p - f(p-1) |
| Amdahl's Max Speedup | S_max = 1/f (as p→∞) |
| FLOPS | Floating-point operations / second |
| Parallel Overhead | T(p) × p - T(1) |
| Strong Scaling | Fixed problem, increase processors |
| Weak Scaling | Problem scales with processors |
Exam Tips
- Flynn's Taxonomy — SISD, SIMD, MISD, MIMD with examples
- Amdahl's Law vs Gustafson's Law — fixed vs scaled problem size
- Shared vs Distributed Memory — UMA, NUMA, MPP
- MPI vs OpenMP — message passing vs shared memory
- CUDA architecture — threads, blocks, grids, warps, memory hierarchy
- GPU vs CPU — throughput-oriented vs latency-oriented
- Cluster vs Grid — tightly vs loosely coupled
- Load balancing — static vs dynamic
- Interconnection networks — hypercube, mesh, fat tree
- Fault tolerance — checkpoint/restart, ECC, replication
- Benchmarking — LINPACK, SPEC, HPCG, Graph 500
- Quantum computing — qubit, superposition, entanglement, Shor's and Grover's algorithms
- IaaS, PaaS, SaaS in HPC context
Practice Questions
11 MCQs for High Performance Computing with detailed explanations.
Q1. Regarding the following concept: 'Transistor count doubles ~every 2 years, but single-thread performance gains are...', which statement is correct?
- A. Transistor count doubles ~every 2 years, but single-thread performance gains are slowing
- B. This concept applies only to analog systems and not digital ones
- C. This approach has been deprecated in all modern implementations
- D. This is defined exclusively at the physical layer of system design
✅ Correct Answer: Option A
Explanation:
The correct answer is Option A — Transistor count doubles ~every 2 years, but single-thread performance gains are slowing
-.
This concept is covered under High Performance Computing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.
Why other options are incorrect:
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
Q2. Regarding the following concept: ':
| Type | Instruction Streams | Data Streams | Description | Example |
|------...', which statement is correct?
- A. This approach has been deprecated in all modern implementations
- B. This is defined exclusively at the physical layer of system design
- C. This concept applies only to analog systems and not digital ones
- D. :
| Type | Instruction Streams | Data Streams | Description | Example |
|---|---|---|---|---|
✅ Correct Answer: Option D
Explanation:
The correct answer is Option D — :
| Type | Instruction Streams | Data Streams | Description | Example |
|---|---|---|---|---|
| . |
This concept is covered under High Performance Computing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.
Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
Q3. Regarding the following concept: 'instruction streams...', which statement is correct?
- A. This approach has been deprecated in all modern implementations
- B. instruction streams
- C. This concept applies only to analog systems and not digital ones
- D. This is defined exclusively at the physical layer of system design
✅ Correct Answer: Option B
Explanation:
The correct answer is Option B — instruction streams.
This concept is covered under High Performance Computing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.
Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
Q4. Regarding the following concept: 'Physical limits:...', which statement is correct?
- A. This approach has been deprecated in all modern implementations
- B. This is defined exclusively at the physical layer of system design
- C. This concept applies only to analog systems and not digital ones
- D. Physical limits:
✅ Correct Answer: Option D
Explanation:
The correct answer is Option D — Physical limits:.
This concept is covered under High Performance Computing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.
Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
Q5. Regarding the following concept: 'Big data, scientific simulations, AI require massive computation
-...', which statement is correct?
- A. This approach has been deprecated in all modern implementations
- B. This concept applies only to analog systems and not digital ones
-
C. Big data, scientific simulations, AI require massive computation
- D. This is defined exclusively at the physical layer of system design
✅ Correct Answer: Option C
Explanation:
The correct answer is Option C — Big data, scientific simulations, AI require massive computation
-.
This concept is covered under High Performance Computing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.
Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
Q6. Regarding the following concept: '- Moore's Law: Transistor count doubles ~every 2 years, but single-thread perfor...', which statement is correct?
- A. This approach has been deprecated in all modern implementations
- B. This is defined exclusively at the physical layer of system design
- C. - Moore's Law: Transistor count doubles ~every 2 years, but single-thread performance gains are slowing
- D. This concept applies only to analog systems and not digital ones
✅ Correct Answer: Option C
Explanation:
The correct answer is Option C — - Moore's Law: Transistor count doubles ~every 2 years, but single-thread performance gains are slowing.
This concept is covered under High Performance Computing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.
Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
Q7. Which of the following best describes Parallel computing?
- A. the simultaneous use of multiple processing elements to solve a computational problem, dividing the problem into smaller parts that can be processed c
- B. sequential (f = 0.1):
- C. 5% (f = 0.05) with 100 processors:
- D. utilized.
✅ Correct Answer: Option A
Explanation:
The correct answer is Option A — the simultaneous use of multiple processing elements to solve a computational problem, dividing the problem into smaller parts that can be processed c.
This concept is covered under High Performance Computing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.
Why other options are incorrect:
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
Q8. Regarding the following concept: '| Single | Single | Sequential computer; one instruction operates on one data it...', which statement is correct?
- A. This concept applies only to analog systems and not digital ones
- B. | Single | Single | Sequential computer; one instruction operates on one data item at a time | Traditional single-core CPU |
| - C. This approach has been deprecated in all modern implementations
- D. This is defined exclusively at the physical layer of system design
✅ Correct Answer: Option B
Explanation:
The correct answer is Option B — | Single | Single | Sequential computer; one instruction operates on one data item at a time | Traditional single-core CPU |
|.
This concept is covered under High Performance Computing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.
Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
Q9. Which of the following best describes Example: If 10% of a program?
- A. sequential (f = 0.1):
- B. the simultaneous use of multiple processing elements to solve a computational problem, dividing the problem into smaller
- C. 5% (f = 0.05) with 100 processors:
- D. utilized.
✅ Correct Answer: Option A
Explanation:
The correct answer is Option A — sequential (f = 0.1):.
This concept is covered under High Performance Computing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.
Why other options are incorrect:
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
Q10. Which of the following best describes Example: If serial fraction?
- A. the simultaneous use of multiple processing elements to solve a computational problem, dividing the problem into smaller
- B. sequential (f = 0.1):
- C. 5% (f = 0.05) with 100 processors:
- D. utilized.
✅ Correct Answer: Option C
Explanation:
The correct answer is Option C — 5% (f = 0.05) with 100 processors:.
This concept is covered under High Performance Computing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.
Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option B — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
Q11. Which of the following best describes cluster?
- A. utilized.
- B. a group of interconnected computers (nodes) that work together as a single system.
- C. sequential (f = 0.1):
- D. the simultaneous use of multiple processing elements to solve a computational problem, dividing the problem into smaller
✅ Correct Answer: Option B
Explanation:
The correct answer is Option B — a group of interconnected computers (nodes) that work together as a single system..
This concept is covered under High Performance Computing in the CBDT Assistant Director Systems syllabus. The answer is established through standard definitions and widely accepted principles in the field.
Why other options are incorrect:
- Option A — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option C — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.
- Option D — This option is factually incorrect or describes a concept from a different domain, making it an invalid choice for this question.