Pipelining processor

Pipeline datapath

In a multicycle datapath we knew that we have different stages set up. However, the entire datapath is exclusively used by 1 instruction, which will be slow. In pipelining, there will be multiple instructions, in fact five instructions, run in parallel in datapath, i.e. adding parallelism to multicycle computer using the same five stages.

Preliminary example

This is an example about the following instructions:

1
2
lw $10, $20($1)
sub $11, $2, $3

In clockcycle 1, the first instruction is in the instruction fetch stage. As soon as the first instruction goes to the instruction decoding stage, the second instruction shows up in the first stage. At the fifth clockcycle, the first instruction finishes at the write-back stage, while the second instruction stops at memory stage. This can be done correctly by using the intermediate registers.

1554183167564

Stages with intermediate registers

1554183280511

Now then, every stage of the pipeline with contain a different instruction. Also, one new instruction begins every clock cycle. Then one completed instruction will occur every clock cycle.

All necessary information with respect to one instruction will follow the intermediate register in the datapath. Also note that we will not need the write register number until the fifth stage. That is to say, the write-register-number will be carried through from stage 1 to 5.

Pipeline Control

Instructions need all the control bits set for their execution to work. The instruction control lines follow the instrcution through the pipeline:

1554183668365

We need to extend these intermediate registers to pass control signals down to where they are used.

First implementation: branch at MEM

Example of using R-format instruction

1554184000677

At the end of the clock cycle, PC will be updated to $PC+4$ and all the pipeline registers are also updated at the end of every clock cycle.

1554184211865

In the instruction decoding stage, the OPCODE is used by the control unit to generate all the control bits, and meanwhile, data are read from register file. The control bits specifically for write-back, memory, and execution stages are stored along with the register data in the pipeline register ID/EX. Particularly, for add instruction, instruction[15-11] indicates the destination register number, which should be carried.

1554184813163

In the execution stage, the EX-info stored in the previous pipeline register will be used up, so the subsequent register will no longer carry it. However, the other carried control-bits and data are passed on.

1554184851043

In MEM stage, add does not really have any work to do, but it must sit idle in this stage and pass on all other information after using MEM-info.

1554184970584

The WB part of the MEM/WB is the RegWrite control-bit. The below part includes the result from ALU and the destination register number, which will all be used to write-back.

Example of using branch instruction

1554185235556

In the MEM stage, the branch target address will override PC such that the next instruction after the OR instruction will fetch correctly.

In the next clock cycle:

1554185391882

The correct instruction after beq is lw. That means or, sw, and would be incorrectly executed, so what do we do? Flush them out! This brings about the concern of pipeline hazards.

Pipeline hazards

Data hazards

1554185661452

  • The first instruction is sub $2, $1, $3. It can only update the value in register 2 at the end of the fifth clock cycles. However, the second instruction sub $12, $2, $5 has to use the value in register 2 at the third clock cycle. Unless we had a time machine, sub $12, $2, $5 would only be able to use the old value. This gives rise to a control hazard.

  • For sw, it is perfectly fine since register 2 will have been updated successfully by the time it is needed for sw.

  • To solve the data hazard between sub and add, we allow read to occur second, write to occur first.

How to handle a data hazard?

We have two solutions:

  1. Stalling: we add a stall to an instruction that needs a value from a previous instruction by using NOPs, which is an instruction that has no effects. We insert them to force a gap between an instruction that has a data hazard with a previous instruction.

    1554186449087

  2. A better solution is to use data forwarding.

    • How to detect a data hazard? Check current instructions $rs and $rt registers: did they depend on the $rd or $rt of the instruction before it, OR 2 instructions before it.

    • We will have a forwarding unit to solve almost all data hazards:

      1554186689718

      with that, the value in destination register will be forwarded as shown

      1554186724849

    • However, things are more complicated if there is a load-use hazard.

Load-use hazard

Consider the following instructions:

1
2
3
100 lw $2, 100($1)
104 add $6, $3, $3
108 and $13, $6, $2

The data for register 2 is NOT ready until the end of MEM stage for lw instruction. The data will be forwarded back to an instruction TWO steps behind. In this case, and is two steps behind lw, so it will be forwarded the data from lw.

1554187353326

Now, consider the following instructions:

1
2
3
100 lw $2, 100($1)
104 add $6, $2, $3
108 and $13, $6, $2

Note add is such an instruction that immediately follows a lw and needs the data from it. This gives rise to a load-use hazard!

1554187472509

Data is not available for instruction directly behind the lw. We have to introduce one clock cycle in between the two instructions lw $2, 100($1) and add $6, $2, $3. Nevertheless, this stall is not meant to be added manually. The datapath will implement this stall if load-use hazard is detected, as shown:

1554188281368

This hardware that automatically stalls the datapath when there is a load-use hazard is a load-use stall.

Forwarding and other instructions

Case #1

1
2
add $2, $3, $4
sw $2, 100($3)

For the sw and I-format instructions, forwarding in the datapath will resolve any data hazards with these instructions.

Case #2

1
2
add $2, $3, $4
bne $2, $0, -5

This is also solved with forwarding in the hardware. Note that the branch instruction is in MEM stage now.

Load-use stall

1554188950420

We have to introduce a stall in between lw and and. This has a hardware solution, i.e., detect load-use hazard and implement a hardware stall:

Detection

  • The instruction in ID/EX register must be lw, which we can check the MemRead signal.
  • Either source of instruction in IF/ID must use the same register as lw rt register.

Stalling*

  • Prevent PC and IF/ID registers from changing
  • This stalls instructions in IF and ID stages.
  • For the NOP instruction, set all control signals to 0.

1554189310374

Branch Control hazards

Always, control hazards heppen because of branch instructions. Flow of instruction sequence changes by them. By the time we are able to figure out whether branch is to be taken, we would have already allowed some instructions that are not needed, and they need to be removed.

Let’s start with branch in MEM.

How many instructions have begun before branch is determined? As before, when branch instruction is just about to go to WB stage, that is at the end of the fourth clock cycle, PC gets updated. There is a new instruction fetched at the fifth stage and this fetching is correct. However, the instructions in between the first and the fifth clock cycles are erounously fetched.

Solutions

  1. Insert three NOPs manually.
  2. Add hardware to FLUSH three instructions that we have begun erounously.

Flushing or wastefully inserting NOPs will decrease the efficiency given that there are many branches possible in program. We would like to advance branching into ID stage. Let’s consider branch in ID.

Branch in ID

image-20190402111306379

With the ‘+’ and ‘=’, branch can be decided in ID stage. The control bits are shown as follows

image-20190402111412035

Consider the following code:

1
2
3
4
0 sub
4 lw
8 beq -3
12 addi

image-20190402112022557

image-20190402112031596

image-20190402112039972

Since we don’t want the instruction addi, we need to flush it. We have a hardware for that, i.e. branch flushing.

image-20190402112312265

The to-be-flushed instruction is at IF stage and the flush is already performed, so that in the next stage, it essentially becomes a NOP, by clearing out all the instruction bits in IF/ID:

image-20190402112352792

This addi is often called the Branch Delay Slot.

Summary

  1. Suppose Branch is in ID stage. MIPS can solve this problem by rearranging the code or adding NOPs. Alternatively, use Branch Flushing to zero all control bits of the instrcution in IF/ID register to flush IF stage.
  2. Suppose Branch is in MEM stage. We either do code rearrangement or add three NOPs. Alternatively, branching flushing is needed.

Branch Data Hazard

1
2
add $2, $3, $5
beq $2, $4, -5

This looks like something we had solved. Recall that forwarding only takes place at the end of MEM stage. Here, the ALU resule won’t be ready until the end of EX, and the result won’t be forwarded until the MEM stage. However, beq could not wait that long. The branching decision must be taken in ID stage, that is exactly when add is in EX stage. At that time, forwarding can’t happen, because of the fact that forwarding unit handles forwarding only in MEM stage. This gives rise to a branch data hazard. This needs fixing by introducing a stall. There will be a hardware that passes the correct data to ID stage and implements the necessary stall.