



CS448





### Scoreboarding

- One way to implement the control needed for forwarding
- Scoreboard stores the state of the pipeline
  - What stage each instruction is in
  - Status of each destination register, source register
  - Can determine if there is a hazard and know which stage needs to be forwarded to what other stage
    Controls via multiplexer selection

6

8

- If state of the pipeline is incomplete
  - Stalls and get pipeline bubbles

# Another Data Hazard Example

- What are the hazards here?
  - ADD R1, R2, R3
  - LW R4, 0(R1)
  - SW 12(R1), R4
- Need forwarding to other stages than the same one

### Data Hazard Classification

- Three types of data hazards
- Instruction i comes before instruction j
  - RAW : Read After Write
    - j tries to read a source before i writes it, so j incorrectly gets the old value. Solve via forwarding.
  - WAW : Write After Write
    - j tries to write an operand before it is written by i, so we end up writing values in the wrong order
    - Only occurs if we have writes in multiple stages

       Not a problem with DLX integer instructions
       We'll see this when we do floating point

### Data Hazard Classification

• WAR : Write After Read

- j tries to write a destination before it is read by i, so i incorrectly gets the new value
- For this to happen we need a pipeline that writes results early in the pipeline, and then other instruction read a source later in the pipeline
- Can this happen in DLX?
- This problem led to a flaw in the VAX
- RAR : Read After Read
  - Is this a hazard?

### Forwarding is not Infallible

- Unfortunately, forwarding does not handle all cases, e.g.:
  - LW R1, 0(R2)
  - SUB R4, R1, R5
  - AND R6, R1, R7
  - OR R8, R1, R9

9

• Load of R1 not available until MEM, but we need it for the second instruction in ALU

10



### Data Hazard Stall

- Need hardware (pipeline interlock) to detect the data hazard and introduce a vertical pipeline bubble
- Other stalls possible too

   Cache miss, stall until data available

| LW RL/MRD     | : 15 | iD.   | ΈX    | MEM   | WB  |     |     |      |     |
|---------------|------|-------|-------|-------|-----|-----|-----|------|-----|
| 51/6/84/81,85 |      | IF    | D     | UX    | MIM | WB  |     |      |     |
| AND REAL RO   |      |       | H.    | iD.   | EX. | MEM | WB  |      |     |
| 01.34.R1.R9   |      |       | - 272 | (P    | ID  | EX  | NEM | WB   |     |
| LW RI. (0(RI) | 15   | 1D    | EX    | MEM   | WTD |     |     |      |     |
| SUBRARI, RS   |      | 1F    | D     | sali  | EX. | MEM | WB  |      |     |
| AND REAL R7   |      | 1.1.1 | F.    | stall | ID. | EX. | MEM | WB . |     |
| OR 88,R1,R9   |      |       |       | (k) 2 | IF. | ID. | EX  | MEM  | :WB |



### Compiler Scheduling Example

- A=B+C; D =E+F
  - LW R1, B – LW R2, C

- LW R4, E

- ADD R3, R1, R2  $\leftarrow$  Need to stall for R2
- SW A, R3
- LW R5, F
- ADD R6, R4, R5  $\leftarrow$  Need to stall for R5
- SW D, R6

13

15

# Compiler Scheduling Example

← Swap instr, no stall

- A=B+C; D =E+F
  - LW R1, B
  - LW R2, C
  - LW R4, E
  - ADD R3, R1, R2 – LW R5, F
  - LW K3, F – SW A, R3
  - ADD R6, R4, R5
  - SW D, R6
    - 5 W D, R0

# Compiler Scheduling a Big Help

- Percentage of loads causing stalls with DLX
  - TeX
    - Unscheduled 65%
  - Scheduled 25%
  - SPICE
    - Unscheduled 42%
  - Scheduled 14% GCC
    - Unscheduled 54%
    - Onscheduled 34%
      Scheduled 31%



### **Control Hazards**

- Big hit in performance can reduce pipeline efficiency by over 1/2
- To reduce the clock cycles in a branch stall:
  - Find out whether the branch is taken or not taken earlier in the pipeline
    - · Avoids longer stalls of everything else in the pipeline

- Compute the taken PC earlier
  - Lets us fetch the next instruction with fewer stalls





# Software-Based Branch Reduction Penalty

- Design ISA to reduce branch penalty
  - DLX's BNEZ, BEQZ, allows condition code to be known during the ID stage
- Branch Prediction
  - Compute likelihood of branching vs. not branching, automatically fetch the most likely target
  - Can be difficult; we need to know branch target in advance

### Branch Behavior

- How often are branches taken?
- For DLX from chapter 2:
  - 17% branches

21

- 3% jumps or calls
- Taken vs. Not varies with instruction use
  - If-then statement taken about 50% of the time
  - Branches in loops taken 90% of the time
  - Flag test branches taken very rarely
- Overall, 67% of conditional branches taken on average

   This is bad, because taking the branch results in the pipeline stall for our typical case where we are fetching subsequent instructions in the pipeline

### Dealing with Branches

- Several options for dealing with branches
  - 1. Pipeline stall until branch target known (previous case we examined)
  - 2. Continue fetching as if we won't take the branch, but then invalidate the instructions if we do take the branch

| Umakes Ivatich instruction | 17  | ID . | #X    | MEM       | WB   |        |       |          |       |
|----------------------------|-----|------|-------|-----------|------|--------|-------|----------|-------|
| terraction /+1             |     | IF:  | ID .  | EX        | MEM  | WB.    |       |          |       |
| Instruction (+2            |     |      | 19    | ID        | EX   | MEM    | W25   |          |       |
| Instruction J + 3          |     |      |       | IF.       | ID   | iX.    | 34134 | 9.11     |       |
| Instruction / + 4          |     |      |       | - 202 - 2 | 1F · | 1D     | EX    | MEM      | 4.9   |
|                            | 111 |      |       | 10.000    |      |        |       | 0.01.020 | 04.22 |
| Taken branch iteraction    | 15  | ID.  | 3.X   | MEM       | WB   | 100-01 |       |          |       |
| Instruction /+ 1           |     | 19   | Hik . | Adle:     | idie | idle   |       |          |       |
| Drunch surger              |     |      | - IF  | iD        | EX   | MEM    | WB    | 1.00     |       |
| Buinch torget + 1          |     |      |       | HE.       | ID:  | EX :   | MEM   | WB.      |       |
| flunch tager + 2           |     |      |       |           | 11   | ID     | EX    | MEM      | 91    |

# Dealing with Branches

- 3. Always fetch the branch target
  - After all, most branches are taken
  - Can't do in DLX because we don't know the target in advance of the branch outcome
  - Other architectures could precompute the target before the outcome



• Put a NOP if we can't find anything

# Delayed Branch with One Delay Slot



Instruction in delay slot always executed Another branch instruction not allowed to be in the delay slot



#### Delay Slot Effectiveness

- Book variations on scheme described here, branch nullifying if branch not taken
- On benchmarks

25

- Delay slot allowed branch hazards to be hidden 70% of the time
- About 20% of delay slots filled with NOPs
- Delay slots we can't easily fill: when target is another branch
- Philosophically, delay slots good?
  - No longer hides the pipeline implementation from the programmers (although it will if through a compiler)
  - Does allow for compiler optimizations, other schemes don't
  - Not very effective with modern machines that have deep pipelines, too difficult to fill multiple delay slots

28

| 1 011                       |                                              | une                          | e of B                                                        |                                                 |                                             |                                   | 00                              |
|-----------------------------|----------------------------------------------|------------------------------|---------------------------------------------------------------|-------------------------------------------------|---------------------------------------------|-----------------------------------|---------------------------------|
| We car                      | n simu                                       | later                        | the four                                                      | scheme                                          | e on                                        | א וח                              |                                 |
| wecai                       | 1 Sinnu                                      | late                         | ine rour                                                      | scheme                                          | 5 011                                       | DLA                               |                                 |
| Given                       | CPI=1                                        | as f                         | he ideal:                                                     |                                                 |                                             |                                   |                                 |
| Given                       |                                              | ust                          | ne ideai.                                                     |                                                 | _                                           |                                   |                                 |
| <ul> <li>– Pipel</li> </ul> | ine Sne                                      | edur                         | ) =                                                           | Pipelii                                         | ne_Dept                                     | h                                 |                                 |
| - i ipci                    | me op                                        | ստար                         |                                                               |                                                 |                                             |                                   |                                 |
| - i ipei                    | ine sp                                       | ceaup                        | 1+Brc                                                         | nch_Freque                                      | ncy×Bra                                     | nch_Pen                           | alty                            |
| – Resu                      | lts:                                         | De                           | layed bra                                                     | nch slig                                        | ntly be                                     | etter                             | Plain                           |
| – Resu                      | lts:                                         | De                           | 1+ Bra                                                        | nch slig                                        | ntly be                                     | etter                             | Plain                           |
| – Resu                      | lts:                                         | De                           | 1+ Bra                                                        | nch slig                                        | ntly be                                     | etter                             | Plain                           |
| – Resu                      | lts:<br>Branch per<br>coolitional            | De                           | 1+ Bra                                                        | nch slig                                        | ntly be                                     | etter<br>Effective C<br>Beauch    | PI with<br>stalls               |
| 1                           | lts:<br>Branch per<br>conditional<br>Juncger | De<br>utyper<br>branch<br>FP | 1+ Bra<br>layed bra:                                          | nch slig<br>Average beau<br>per tra             | ntly be                                     | Effective C<br>Branchy<br>Integer | PL with<br>stalls<br>FP         |
| – Resu                      | lts:<br>Branch per<br>coefficient<br>beinger | De<br>branch<br>FP<br>1.00   | 1+ Bra<br>layed bra:<br>Prasity per<br>encodificati<br>branch | nch slig<br>Average beau<br>per tera<br>loteger | ntly be<br>th penalty<br>arth<br>FP<br>1.80 | Effective C<br>Beauchy<br>Integer | PL with<br>stalls<br>FP<br>1.15 |