This is what we covered in the first class -- slides 1 to 25 of lecture 18. Architecture discussion starts with the ISA. For NVIDIA, PTX is a virtual ISA (its not the real ISA, but its close; for example, it allows the use of unlimited registers). NVCC compiles it to the true ISA as it produces binary (this is SASS). Different binaries, corresponding to different compute capabilities, are included in the executable. When the program executes, it picks the right version of SASS for the GPU platform, or uses just in time compilation of the PTX if the platform is new. Look at slide 11 for examples of PTX. Its a RISC instruction set. Notice the use of predication for conditional branches. GPGPU Architecture: The front end of an SIMT core (in green on slides 18 and 19) implements the fetch, decode, and schedule for execution components of the GPU architecture -- it multiplexes the different warps on top of the available SIMD data path. Fetch Decode Stage (slide 20): ------------------------------ The pipeline starts with the fetch/decode stage (slide 20). Instructions are scheduled from an instruction buffer (I-Buffer) which has a fixed number of instructions lots for every warp. So, we can only bring more instructions to the I-Buffer for a given warp if it has empty slots in there. The I-Buffer has a valid bit which if set indicates that there is a valid instruction in the slot. It also has another bit r, which indicates whether the instruction is waiting on a dependency from the scoreboard (explained soon). Now, for a warp to have an available slot in the I-Buffer, and therefore to be a candidate to fetch more instructions, it should have at least one slot empty (v bit is 0). All warps with availability in the I-buffer, are candidates for fetching instructions. The v-bits are communicated to the fetch arbitration logic (see the arrow marked "to fetch" going left from the I-Buffer, which is also the arrow marked Valid[1:N] going into the fetch logic. The first scheduler then looks at the warps that have room based on the valid bits, and selects one of them to fetch instructions for from the I-Cache. If the instructions are in the I-Cache, the are fetched, decoded and placed in the I-Buffer. If not, we issue the fetch of the instruction from main memory and continue operation (we don't wait for it). The Warp will later be scheduled again at which point we will have an I-Cache hit. Instruction Issue: ------------------ Starting with slide 21, note that a warp is eligible to issue an instruction if it has a valid and ready (according to the scoreboard) in the I-Buffer. Remember that this is an in-order pipeline, so we are talking about the first instruction in order being valid and ready; we don't issue instructions out of order. Here the second scheduler (the warp scheduler) comes into play to decide which warp to run next. Schedulers such as Greedy-then-oldest (GTO) and odd/even scheduling are used here. We will talk about this issue in more detail later. Scoreboard: GPUs work in-order and issue one or two instruction at a time (the odd-even scheduler can issue two instructions at a time, but from different warps). Scoreboarding keeps track of dependencies for instructions from the same warp. Instructions from a warp get issued one instruction in each cycle that the scheduler selects the warp to execute. However, since the instruction execution can take multiple cycles, two instructions can be active at the same time. Scoreboarding keeps track of dependencies to make sure we do not allow an instruction to start executing if there is a dependency with a previous instruction that is still executing. The ready bit is set by the hardware scoreboard (slides 22 to 24). To simplify, a scoreboard can simply track the state of each register for each warp. If the warp issues an instruction whose destination is a register, we mark that register to say that the value in it is not available. This way, if an instruction comes later that needs this value, it knows to wait until the value is generated by the instruction that set this bit. You can see that this simple approach takes care of both Read After Write (RAW) and Write After Write (WAW) dependencies. To illustrate, here is an example: add r3, r2, r1 // r3 = r2+r1 sub r5, r3, r4 // RAW add r5, r2, r1 // WAW After the first instruction issues, we mark r3 as unavailable. When the sub instruction arrives, it cannot issue since r3 is not ready. After the first instruction completes, sub now can read the new value of r3 and issue, marking r5 which is the destination register as unavailable. The third instruction cannot issue since it writes to r5 (WAW). (What do we do about WAR dependencies?) Not a problem in this case because we issue in order: the read has already read the register values when we issue the write after it. Implementation of scoreboard: Its very expensive to implement the scoreboard as described above. We would need a bit for every register for every warp. They have to be updated as we operate. So, we take a short cut. We track only a limited number of dependencies for every warp -- lets say 6. If we run out of slots, we can delay issue of ready instructions that need dependency slots. 6 was chosen to balance overhead against the likelihood of blocking instructions because we don't have enough space in the scoreboard. We need to add a scoreboard table that identifies which register each of the 6 bits are tracking. See the example on slide 24 to see how the implementation works.