🥸Advanced Computer Architecture Unit 6 – Out-of-Order Execution & Register Renaming

Out-of-order execution and register renaming are advanced techniques that boost processor performance. These methods allow instructions to be executed in a different order than the program sequence while maintaining data dependencies, increasing instruction-level parallelism and reducing pipeline stalls. These techniques enable processors to execute independent instructions simultaneously, hide memory latency, and speculate on branch outcomes. By using a larger set of physical registers and tracking instruction order with a reorder buffer, processors can eliminate false dependencies and maintain precise exception handling.

Study Guides for Unit 6

6.1

Out-of-Order Execution Principles

5 min read

6.2

4 min read

6.3

Reorder Buffer and Commit Stage

4 min read

6.4

Instruction Scheduling Algorithms

5 min read

Key Concepts

Out-of-order execution allows instructions to be executed in a different order than the program sequence while maintaining data dependencies
Register renaming eliminates false dependencies (write-after-read and write-after-write) by mapping architectural registers to a larger set of physical registers
Instruction-level parallelism (ILP) is exploited by executing independent instructions simultaneously on multiple functional units
Speculation and branch prediction enable the processor to fetch and execute instructions before knowing if they are needed
Precise exceptions ensure that the processor state can be restored to a known good state if an exception occurs during out-of-order execution
- This is achieved by maintaining a reorder buffer (ROB) that tracks the original program order
- Completed instructions are retired from the ROB in program order
Commit stage finalizes the results of instructions and updates the architectural state once all previous instructions have completed

Motivation and Benefits

Out-of-order execution improves performance by reducing pipeline stalls caused by data dependencies and resource conflicts
Allows the processor to continue executing instructions even if some instructions are blocked due to long-latency operations (cache misses)
Increases the utilization of functional units by executing independent instructions in parallel
Hides memory latency by overlapping memory accesses with other computations
Enables the processor to speculatively execute instructions based on predicted branches
- If the prediction is correct, the speculative work is useful and improves performance
- If the prediction is incorrect, the speculative work is discarded, and the processor rolls back to a known good state
Reduces the impact of pipeline hazards (data, control, and structural) on performance
Provides a higher instruction throughput and reduces the average cycles per instruction (CPI)

Out-of-Order Execution Basics

Instructions are fetched and decoded in program order but executed based on data dependencies and resource availability
Instructions are placed into a reservation station or issue queue after decoding
- The reservation station holds instructions until their operands are ready and a functional unit is available
A dependency check is performed to ensure that instructions with data dependencies are executed in the correct order
Independent instructions can be issued and executed out of order, allowing for parallel execution on multiple functional units
A reorder buffer (ROB) is used to track the original program order and maintain precise exceptions
- Instructions are allocated an entry in the ROB when they are decoded
- Completed instructions are marked as done in the ROB but not retired until all previous instructions have completed
A commit stage retires instructions in program order, updating the architectural state and freeing resources

Register Renaming Techniques

Register renaming eliminates false dependencies caused by the limited number of architectural registers
False dependencies include write-after-read (WAR) and write-after-write (WAW) dependencies
Architectural registers are mapped to a larger set of physical registers
- This allows multiple instructions to write to the same architectural register without causing dependencies
Two main techniques for register renaming: explicit and implicit
- Explicit renaming uses a rename table to map architectural registers to physical registers
  - The rename table is updated when instructions are decoded and retired
- Implicit renaming uses a reorder buffer (ROB) to track the latest value of each architectural register
  - The ROB entry number serves as the physical register identifier
Register renaming is performed in the decode stage and undone in the commit stage
Checkpointing is used to save the state of the rename table or ROB at specific points (branches) to enable quick recovery from mispredictions

Hardware Implementation

Out-of-order execution and register renaming require additional hardware components compared to in-order processors
Key components include:
- Reservation stations or issue queues to hold instructions waiting for execution
- Reorder buffer (ROB) to track the original program order and maintain precise exceptions
- Physical register file (PRF) to store the renamed registers and enable parallel execution
- Rename table or mapping mechanism to map architectural registers to physical registers
- Wakeup and select logic to determine when instructions are ready to execute and issue them to functional units
Functional units are typically organized into execution clusters (integer, floating-point, load/store) to minimize routing complexity
A common data bus (CDB) is used to broadcast results from functional units to reservation stations and the ROB
Speculation and branch prediction require additional hardware
- Branch target buffer (BTB) to predict branch targets and enable early fetching of instructions
- Branch history table (BHT) to predict the direction of branches based on past behavior
- Speculative state management to track and discard speculative work if predictions are incorrect

Performance Impact

Out-of-order execution and register renaming significantly improve performance compared to in-order processors
Allows for better utilization of functional units and reduces pipeline stalls due to dependencies
Enables the processor to hide memory latency by overlapping memory accesses with other computations
Increases the instruction-level parallelism (ILP) by executing independent instructions simultaneously
Reduces the impact of pipeline hazards (data, control, and structural) on performance
Provides a higher instruction throughput and reduces the average cycles per instruction (CPI)
- CPI can approach 1 or even less than 1 with sufficient ILP and functional units
Performance gains depend on the application characteristics and the available ILP
- Applications with more independent instructions and fewer dependencies benefit more from out-of-order execution
Branch prediction accuracy is critical for performance, as mispredictions result in discarded speculative work and pipeline flushes

Challenges and Limitations

Out-of-order execution and register renaming add complexity to the processor design and verification
Increased hardware cost due to additional components (reservation stations, ROB, PRF, rename logic)
Power consumption and heat dissipation increase with the added complexity and hardware
Scalability challenges as the instruction window size and the number of physical registers increase
- Larger instruction windows and physical register files can increase the latency of wakeup and select logic
Memory dependencies and long-latency operations (cache misses) can still limit the achievable performance
Branch mispredictions can result in wasted speculative work and pipeline flushes, reducing performance
Precise exception handling becomes more challenging with out-of-order execution
- Processor state must be saved and restored correctly to ensure precise exceptions
Debugging and performance analysis become more difficult due to the non-deterministic execution order

Real-World Applications

Out-of-order execution and register renaming are used in most modern high-performance processors (x86, ARM, POWER)
Examples of processors using out-of-order execution:
- Intel Core series (i3, i5, i7, i9) processors
- AMD Ryzen processors
- ARM Cortex-A series processors (A55, A75, A76)
- IBM POWER processors (POWER9, POWER10)
Out-of-order execution is particularly beneficial for applications with high instruction-level parallelism (ILP)
- Scientific simulations and numerical computations
- Video and image processing
- Cryptography and encryption algorithms
Compilers and software optimization techniques can be used to expose more ILP and improve the performance of out-of-order processors
- Loop unrolling, software pipelining, and instruction scheduling
- Profile-guided optimization (PGO) to identify frequently executed code paths and optimize them for out-of-order execution
Out-of-order execution has been a key enabler for the performance improvements in processors over the past few decades
- Allows for higher clock frequencies and better utilization of hardware resources
- Enables the development of more complex and demanding applications