A critical part of the SIMT model (also the SIMD model!) is how to enable threads to carry out conditional statements based on local data (which could be different for different threads). This is the behavior that leads to thread divergence. The support for this behavior is provided using the SIMT-stack. This stack sends the target PC to the fetch unit (to control which instruction is fetched next) and the active mask to the issue unit (to control which lanes of the warp are active). When we hit a conditional branch, the branch evaluates differently based on the taken/not-taken condition. Thread divergence is supported by keeping track of a mask of the active threads for every conditional component. The mask is a bit vector with 1 for every thread that is active for the corresponding control flow branch. When that control flow branch is being executed, only the threads with 1 in the corresponding branch bit execute those instructions. To minimize the impact of control divergence, threads must reconverge at the earliest possible point when the divergence is done (even though it is possible to reconverge later, or even at the end of the program). This called the post-dominator in control flow graph/compiler terminology where the active mask at the post-dominator dominates the active mask of all the control flow paths that meet at a post-dominator. In other words, the post-dominator active mask has 1 for every thread that is active in each of the divergent paths that reconverge at that point. To support nested divergence and reconvergence, the SIMT stack implementation is explained on slide 25. When we hit a divergent point, we push on the stack: (1) the current active mask and the next PC at the reconverge point; (2) the active mask, PC, and reconverge PC for every branch. We start executing one of them with the specified active mask, until we reach the reconverge PC. At that point, this branch is done and is popped off the stack. We set the active mask to the next branch (the top of the stack) and execute from the next PC until that too hits the reconverge PC. Once we are done with all the branches, the stack has only the original active mask and the PC at the reconverge point -- we are converged! In practice, this is implemented using predication. We went through an example on slides 27 and 28. Predication support in the instruction set works as follows: If an instruction is preceded by @P it means that execution is predicated on the value of the predicate register P (i.e., it only occurs if the corresponding bit in P is true). P is a register with one bit corresponding to every thread. Typically p is set by a previous setp instruction which sets its values based on the specified comparison result at every thread. For example, the setp.neq.s32 instruction compares RD0 and the immediate 0 and sets P1 if they are not equal. If an instruction is ended by a *OP operation, this is an unconditional operation on an active mask/active mask stack. *Push, pushes on the stack, *Comp complements mask bits, and *Pop pops a value from the stack. This is a simplified example, and I dont expect you to know the details of the implementation -- its enough to understand the conceptual implementation on slide 25. Register File: -------------- Next, we discussed the register file. Once instructions are ready to issue they must get their operands from the register file. The register file is very large and we need to at least make it 4-ported if we want to be able to read three operands (as required by multiple and add) and write one in one cycle. Also, the port is very wide since it needs to get 32 register values at a time. This is very expensive for a register file of this side. So, to simplify the design, we implement the register file as a multi-banked register, with each bank being single-ported. If we are lucky and the four operands go to different banks, we get the effect of a multi-ported register. However, if the register operands map to the same bank, the accesses have to be serialized. The serialization has two implications. First, even when an instruction is ready to issue, the operands may not be available immediately. So, they have to stay in place until the operands have been read by the operand collector logic. This is a concept similar to reservation stations in Tomasulo's algorithm for scheduling insruction execution in out-of-order processors. Second, access serialization can slow execution time. If we allow accesses from different warps to go on together, and carefully schedule them so that accesses to different banks can go on together, performance is significantly improved. This scheduling is also done by the operand collector. Slide 35 shows an example. AMD Southern Islands: --------------------- Finally, we talked about an optimization that AMD implements in their pipeline. The main idea is that there are sometimes scalar variables that are used in device code. If these scalars are executed by the warps, very low utilization results. So, they have a scalar ALU as part of their SIMT core, with its own instructions to improve the efficiency of handling scalars. This concludes the basic architecture overview. In the next note, we will start discussing research directions.