All Study Guides Advanced Computer Architecture Unit 2
🥸 Advanced Computer Architecture Unit 2 – Instruction Parallelism and PipeliningInstruction parallelism and pipelining are key techniques for boosting processor performance. These methods allow multiple instructions to be executed simultaneously, increasing throughput and efficiency. By breaking down instruction execution into stages and overlapping them, processors can handle more work in less time.
Pipelining divides instruction execution into stages like fetch, decode, and execute. This assembly-line approach lets processors work on multiple instructions at once. However, challenges like data dependencies and branch prediction must be addressed to maintain smooth pipeline flow and maximize performance gains.
Fundamentals of Instruction Parallelism
Instruction parallelism exploits the potential for instructions to be executed simultaneously
Increases the overall throughput and performance of a processor by utilizing multiple functional units
Requires the identification of independent instructions that can be executed in parallel without data dependencies
Instruction-level parallelism (ILP) is a measure of how many instructions can be executed concurrently in a program
Compiler techniques (loop unrolling, software pipelining) and hardware techniques (out-of-order execution, superscalar) are used to exploit ILP
Amdahl's Law states that the speedup of a program is limited by the fraction of the program that cannot be parallelized
Flynn's Taxonomy classifies computer architectures based on the number of concurrent instruction and data streams (SISD, SIMD, MISD, MIMD)
Pipelining Basics and Concepts
Pipelining is a technique that divides the execution of an instruction into multiple stages
Enables overlapping execution of multiple instructions, similar to an assembly line
Each stage of the pipeline performs a specific task (fetch, decode, execute, memory access, write-back)
Instruction pipeline increases the overall throughput of the processor by reducing the average execution time per instruction
Pipeline registers are used to store intermediate results between pipeline stages
Ideal CPI (Cycles Per Instruction) in a pipelined processor approaches 1, indicating that one instruction is completed every clock cycle
Pipeline depth refers to the number of stages in the pipeline and affects the clock frequency and instruction latency
Pipeline Hazards and Mitigation Strategies
Pipeline hazards are situations that prevent the next instruction from executing during its designated clock cycle
Structural hazards occur when hardware resources (memory, register file) are required by multiple instructions simultaneously
Mitigated by duplicating hardware resources or using separate instruction and data memories
Data hazards arise when an instruction depends on the result of a previous instruction that has not yet completed
Mitigated by forwarding (bypassing) results between pipeline stages or stalling the pipeline until the dependency is resolved
Control hazards occur when the outcome of a branch instruction is not known, causing subsequent instructions to be fetched incorrectly
Mitigated by branch prediction techniques (static, dynamic) and speculative execution
Instruction scheduling and compiler optimizations can help reduce pipeline hazards by reordering instructions
Superscalar and VLIW Architectures
Superscalar architectures issue multiple instructions per clock cycle from a single instruction stream
Dynamically identify independent instructions and dispatch them to multiple functional units
Require complex hardware for instruction scheduling, dependency checking, and out-of-order execution
Very Long Instruction Word (VLIW) architectures use a fixed-length instruction format with multiple operation fields
VLIW instructions specify multiple independent operations that can be executed in parallel
Rely on the compiler to perform instruction scheduling and dependency analysis statically
Tradeoffs between hardware complexity (superscalar) and compiler complexity (VLIW)
Examples of superscalar processors: Intel Core, AMD Ryzen; VLIW processors: Itanium, TI C6000
Branch Prediction and Speculative Execution
Branch prediction techniques aim to minimize the impact of control hazards in pipelined processors
Static branch prediction uses fixed heuristics (always taken, always not taken) based on instruction type or branch direction
Dynamic branch prediction uses runtime information (branch history table, two-level adaptive predictors) to make predictions
Branch target buffer (BTB) caches the target addresses of recently executed branches to avoid pipeline stalls
Speculative execution allows the processor to fetch and execute instructions along the predicted path before the branch outcome is known
If the branch prediction is incorrect, the speculatively executed instructions are discarded, and the pipeline is flushed
Branch prediction accuracy is crucial for maintaining high performance in pipelined processors
Advanced branch prediction techniques (neural branch prediction, perceptron-based predictors) improve prediction accuracy
Memory System Support for Pipelining
Memory hierarchy design plays a crucial role in supporting pipelined execution
Caches (L1, L2, L3) provide fast access to frequently used instructions and data, reducing memory access latency
Cache miss penalties can stall the pipeline, degrading performance
Techniques like cache prefetching, non-blocking caches, and out-of-order memory accesses help hide cache miss latency
Memory disambiguation techniques (load-store queues, memory dependence prediction) resolve memory dependencies in out-of-order execution
Instruction cache and data cache are often separated to avoid conflicts and improve bandwidth
Memory consistency models (sequential consistency, weak ordering) define the ordering constraints for memory operations in pipelined systems
Performance metrics for pipelined processors include throughput, latency, and speedup
Throughput represents the number of instructions completed per unit time (instructions per cycle, IPC)
Latency is the time taken to complete a single instruction from start to finish
Speedup is the ratio of the execution time of a non-pipelined processor to that of a pipelined processor
Pipeline stalls and hazards impact the actual performance, reducing the achieved IPC below the ideal value
Average instruction execution time is affected by the pipeline depth, stage delays, and stall cycles
Amdahl's Law limits the overall speedup achievable through pipelining based on the fraction of non-pipelined execution
Performance analysis tools (hardware counters, simulation, profiling) help identify bottlenecks and optimize pipelined systems
Advanced Techniques and Future Trends
Superpipelining increases the number of pipeline stages to achieve higher clock frequencies
Superthreading (simultaneous multithreading, SMT) allows multiple threads to execute concurrently on a single pipeline
Speculative multithreading speculatively executes multiple threads in parallel, exploiting thread-level parallelism
Trace caches store decoded instructions in the order of program execution, reducing decode and fetch latencies
Decoupled architectures separate the instruction fetch and execution units, allowing them to operate independently
Dataflow architectures execute instructions based on data availability rather than sequential program order
Reconfigurable architectures (FPGAs) allow the hardware to be customized for specific application requirements
Future trends in pipelining include adaptive pipeline depths, dynamic resource allocation, and hybrid architectures combining different parallelism techniques