In a multicycle datapath we knew that we have different stages set up. However, the entire datapath is exclusively used by 1 instruction, which will be slow. In pipelining, there will be multiple instructions, in fact five instructions, run in parallel in datapath, i.e. adding parallelism to multicycle computer using the same five stages.
This is an example about the following instructions:
lw $10, $20($1)
sub $11, $2, $3
In clockcycle 1, the first instruction is in the instruction fetch stage. As soon as the first instruction goes to the instruction decoding stage, the second instruction shows up in the first stage. At the fifth clockcycle, the first instruction finishes at the write-back stage, while the second instruction stops at memory stage. This can be done correctly by using the intermediate registers.
Now then, every stage of the pipeline with contain a different instruction. Also, one new instruction begins every clock cycle. Then one completed instruction will occur every clock cycle.
All necessary information with respect to one instruction will follow the intermediate register in the datapath. Also note that we will not need the write register number until the fifth stage. That is to say, the write-register-number will be carried through from stage 1 to 5.
Instructions need all the control bits set for their execution to work. The instruction control lines follow the instrcution through the pipeline:
We need to extend these intermediate registers to pass control signals down to where they are used.
At the end of the clock cycle, PC will be updated to $PC+4$ and all the pipeline registers are also updated at the end of every clock cycle.
In the instruction decoding stage, the OPCODE is used by the control unit to generate all the control bits, and meanwhile, data are read from register file. The control bits specifically for write-back, memory, and execution stages are stored along with the register data in the pipeline register ID/EX. Particularly, for
instruction[15-11] indicates the destination register number, which should be carried.
In the execution stage, the EX-info stored in the previous pipeline register will be used up, so the subsequent register will no longer carry it. However, the other carried control-bits and data are passed on.
In MEM stage,
add does not really have any work to do, but it must sit idle in this stage and pass on all other information after using MEM-info.
The WB part of the MEM/WB is the
RegWrite control-bit. The below part includes the result from ALU and the destination register number, which will all be used to write-back.
In the MEM stage, the branch target address will override PC such that the next instruction after the OR instruction will fetch correctly.
In the next clock cycle:
The correct instruction after
lw. That means
and would be incorrectly executed, so what do we do? Flush them out! This brings about the concern of pipeline hazards.
The first instruction is
sub $2, $1, $3. It can only update the value in register 2 at the end of the fifth clock cycles. However, the second instruction
sub $12, $2, $5has to use the value in register 2 at the third clock cycle. Unless we had a time machine,
sub $12, $2, $5would only be able to use the old value. This gives rise to a control hazard.
sw, it is perfectly fine since register 2 will have been updated successfully by the time it is needed for
To solve the data hazard between
add, we allow read to occur second, write to occur first.
We have two solutions:
Stalling: we add a stall to an instruction that needs a value from a previous instruction by using
NOPs, which is an instruction that has no effects. We insert them to force a gap between an instruction that has a data hazard with a previous instruction.
A better solution is to use data forwarding.
How to detect a data hazard? Check current instructions
$rtregisters: did they depend on the
$rtof the instruction before it, OR 2 instructions before it.
We will have a forwarding unit to solve almost all data hazards:
with that, the value in destination register will be forwarded as shown
- However, things are more complicated if there is a load-use hazard.
Consider the following instructions:
100 lw $2, 100($1)
104 add $6, $3, $3
108 and $13, $6, $2
The data for register 2 is NOT ready until the end of MEM stage for
lw instruction. The data will be forwarded back to an instruction TWO steps behind. In this case,
and is two steps behind
lw, so it will be forwarded the data from
Now, consider the following instructions:
100 lw $2, 100($1)
add is such an instruction that immediately follows a
lw and needs the data from it. This gives rise to a load-use hazard!
Data is not available for instruction directly behind the
lw. We have to introduce one clock cycle in between the two instructions
lw $2, 100($1) and
add $6, $2, $3. Nevertheless, this stall is not meant to be added manually. The datapath will implement this stall if load-use hazard is detected, as shown:
This hardware that automatically stalls the datapath when there is a load-use hazard is a load-use stall.
add $2, $3, $4
sw and I-format instructions, forwarding in the datapath will resolve any data hazards with these instructions.
add $2, $3, $4
This is also solved with forwarding in the hardware. Note that the branch instruction is in MEM stage now.
We have to introduce a stall in between
and. This has a hardware solution, i.e., detect load-use hazard and implement a hardware stall:
- The instruction in ID/EX register must be
lw, which we can check the
- Either source of instruction in IF/ID must use the same register as
- Prevent PC and IF/ID registers from changing
- This stalls instructions in IF and ID stages.
- For the NOP instruction, set all control signals to 0.
Always, control hazards heppen because of branch instructions. Flow of instruction sequence changes by them. By the time we are able to figure out whether branch is to be taken, we would have already allowed some instructions that are not needed, and they need to be removed.
Let’s start with branch in MEM.
How many instructions have begun before branch is determined? As before, when branch instruction is just about to go to WB stage, that is at the end of the fourth clock cycle, PC gets updated. There is a new instruction fetched at the fifth stage and this fetching is correct. However, the instructions in between the first and the fifth clock cycles are erounously fetched.
- Insert three NOPs manually.
- Add hardware to FLUSH three instructions that we have begun erounously.
Flushing or wastefully inserting NOPs will decrease the efficiency given that there are many branches possible in program. We would like to advance branching into ID stage. Let’s consider branch in ID.
With the ‘+’ and ‘=’, branch can be decided in ID stage. The control bits are shown as follows
Consider the following code:
Since we don’t want the instruction
addi, we need to flush it. We have a hardware for that, i.e. branch flushing.
The to-be-flushed instruction is at IF stage and the flush is already performed, so that in the next stage, it essentially becomes a NOP, by clearing out all the instruction bits in IF/ID:
addi is often called the Branch Delay Slot.
- Suppose Branch is in ID stage. MIPS can solve this problem by rearranging the code or adding NOPs. Alternatively, use Branch Flushing to zero all control bits of the instrcution in IF/ID register to flush IF stage.
- Suppose Branch is in MEM stage. We either do code rearrangement or add three NOPs. Alternatively, branching flushing is needed.
add $2, $3, $5
This looks like something we had solved. Recall that forwarding only takes place at the end of MEM stage. Here, the ALU resule won’t be ready until the end of
EX, and the result won’t be forwarded until the
MEM stage. However,
beq could not wait that long. The branching decision must be taken in ID stage, that is exactly when
add is in
EX stage. At that time, forwarding can’t happen, because of the fact that forwarding unit handles forwarding only in
MEM stage. This gives rise to a branch data hazard. This needs fixing by introducing a stall. There will be a hardware that passes the correct data to ID stage and implements the necessary stall.