We discussed a number of active research directions in GPGPU architecture. The first direction is supporting control divergence. We started (Slide 12, lecture notes 19) by looking at how applications perform when we reduce the warp size to 4, without changing the number of available ALUs. As you can expect, some applications perform better --these are control divergent applications that now, for some applications, would have less divergence. Handling control divergence can significantly improve performance. Surprisingly, some perform worse. This occurs due to the loss of memory coalescing in some applications. This is also an important observartion -- whatever we do should not come at the expense of exacerbating other bottlenecks. The first solution we looked at is Dynamic Warp Scheduling (DWS) from Micro 2007. At a high level, the idea is to exploit the idle GPU lanes that occur when there is control divergence by letting threads from other warps (but in the same kernel and that have the same Program counter) use these idle lanes. One limitation is that we cannot change the GPU lane: On slide 14, threads 2 and 3 (in lanes 2 and 3 respetively) can be scheduled with threads 5 and 8 (from another warp) since threads 6 and 7 are inactive for this control flow branch. Threads 2 and 3 use the same lanes as threads 6 and 7. With DWS, we manage to increase ALU utilization and reduce the number of instructions that are needed to complete the branch divergent part of the code. However, three unexpected problems happen. First, since the scheduler favors the largest group of threads with the same PC, starvation can occur for a small group of threads always delayed in favor of a larger group. Eventually, this harms performance, for example when we hit a barrier and have to wait for the starved threads to catch up, increasing run time. Second, DWS can lead to loss of memory caolescing even in code segments that do not have branch divergence. See slide 16 for an example. Finally, DWS can break implicit synchrony within a warp as the threads of a warp are no longer guaranteed to execute in lock-step (i.e., the same instruction at a time). See slide 17 for an example of code that gets broken by DWS. Thread block compaction solves (rather improves) these issues by forcing a reconvergence after a branch divergent code segment. Starvation is reduced since the barrier at the end of the divergent code segment forces all threads to catch up. Since convergent code is executed without DWS/compaction, memory coalescing is preserved. Finally, the threads are lockstep in convergent code segments. We discussed other solutions briefly (up to slide 30). The second research direction is warp scheduling, primarily to improve memory locality but for other purposes as well (Criticality). The main insight is that the scheduler by determining which warps execute influences the memory access pattern presented to memory. Having more warps to schedule from improves core utilization -- when one or more warps block waiting for memory or synchronization, others are available to keep the GPU working. However, hiven the severely limited cache space (in this case, we assume L1, but this applies to L2 and lower), too many active warps can cause cache misses, and reduce performance. The baseline schedulers are round robin (RR among the available warps) and greedy-then-oldest (keep scheduling from one warp until it stalls due to a cache miss or other event, and then switch to the oldest remaining warp). GTO favors scheduling from just enough warps to hide memory latency. There are also two level variants of both of these policies where the scheduler schedules only from one level. Occasionally, warps in the second level are added to the first level and start getting scheduled. CCWS (slide 40), explicitly takes the size of the cache into account. It works by estimating the amount of cache space each warp needs, and schedules warps such that their collective need does not exceed the cache space. This is a principle behind several schedulers that we looked at. CCWS estimates the cache needs by keeping a small cache of recent victims from each warp. If a warp experiences a miss, and the address matches a tag in the victim cache, that means the miss was due to a cache line that was removed recently and while it is still active. The system increases its estimate of the cache space needed by that warp.w At any given time, we schedule warps such that their collective cache requirement does not exceed the cache space. Two variations on this general idea: static wavefront limiting, profiles the execution to estimate the cache need for each warp and provides that estimate to the scheduler. Limitations of this approach (and profiling in general) include in addition to the need to profile, the fact that it cannot account for input dependent behavior, or track warps whose usage varies over time. The second variant is Divergence Aware Warp Scheduling (DAWS). DAWS tries to have the compiler estimate divergence and then maps the divergence to estimate cache size requirements for each warp. DAWS focuses on loops, and then classifies memory accesses within loops as divergent or non-divergent. For divergent loads, it estimates the control divergence and the size of the active mask, from which it estimates the number of cache lines needed for that load (one cache line for every active thread). Thus, it replaces the reactive component of CCWS, or the profile driven estimate of static wavefront profiling, with a more intelligent and proactive estimate from the compiler. We also quickly overviewed some other schedulers. Priority based cache allocation (slide 50) observes that limiting the number of warps by the L1 capacity leaves most of L2 unused. So, it extends CCWS by another level where those warps are scheduled but not cached in L1. The benefit of having these extra warps active is that they increase utilization if the other warps stall, and keeps more warps moving reducing the impact of barriers (Where we have to wait for warps that have not been scheduled to catch up). On slide 51, we talked about criticality aware scheduling, where we estimate the critical path of the computation and give more resources (Scheduling slots and cache space) to the warps on the critical path. From here we very briefly discussed three other research direction: (1) Coherent CPU-GPU memory abstractions; (2) Synchronization support and Transactional memory; and (3) power efficiency for GPGPUs.