We discussed a number of active research directions in GPGPU architecture.

The first direction is supporting control divergence.  We started
(Slide 12, lecture notes 19) by looking at how applications perform
when we reduce the warp size to 4, without changing the number of
available ALUs.  As you can expect, some applications perform better
--these are control divergent applications that now, for some
applications, would have less divergence.  Handling control divergence
can significantly improve performance.  Surprisingly, some perform
worse.  This occurs due to the loss of memory coalescing in some
applications.  This is also an important observartion -- whatever we
do should not come at the expense of exacerbating other bottlenecks.

The first solution we looked at is Dynamic Warp Scheduling (DWS) from
Micro 2007.  At a high level, the idea is to exploit the idle GPU
lanes that occur when there is control divergence by letting threads
from other warps (but in the same kernel and that have the same
Program counter) use these idle lanes.  One limitation is that we
cannot change the GPU lane: On slide 14, threads 2 and 3 (in lanes 2
and 3 respetively) can be scheduled with threads 5 and 8 (from another
warp) since threads 6 and 7 are inactive for this control flow branch.
Threads 2 and 3 use the same lanes as threads 6 and 7.

With DWS, we manage to increase ALU utilization and reduce the number
of instructions that are needed to complete the branch divergent part
of the code.

However, three unexpected problems happen.  First, since the scheduler
favors the largest group of threads with the same PC, starvation can
occur for a small group of threads always delayed in favor of a larger
group.  Eventually, this harms performance, for example when we hit a
barrier and have to wait for the starved threads to catch up,
increasing run time.

Second, DWS can lead to loss of memory caolescing even in code
segments that do not have branch divergence.  See slide 16 for an
example.  Finally, DWS can break implicit synchrony within a warp as
the threads of a warp are no longer guaranteed to execute in lock-step
(i.e., the same instruction at a time).  See slide 17 for an example
of code that gets broken by DWS.

Thread block compaction solves (rather improves) these issues by
forcing a reconvergence after a branch divergent code segment.
Starvation is reduced since the barrier at the end of the divergent
code segment forces all threads to catch up.  Since convergent code is
executed without DWS/compaction, memory coalescing is preserved.
Finally, the threads are lockstep in convergent code segments.

We discussed other solutions briefly (up to slide 30).

The second research direction is warp scheduling, primarily to improve
memory locality but for other purposes as well (Criticality).  The
main insight is that the scheduler by determining which warps execute
influences the memory access pattern presented to memory.  Having more
warps to schedule from improves core utilization -- when one or more
warps block waiting for memory or synchronization, others are
available to keep the GPU working.  However, hiven the severely
limited cache space (in this case, we assume L1, but this applies to
L2 and lower), too many active warps can cause cache misses, and
reduce performance.

The baseline schedulers are round robin (RR among the available warps)
and greedy-then-oldest (keep scheduling from one warp until it stalls
due to a cache miss or other event, and then switch to the oldest
remaining warp).  GTO favors scheduling from just enough warps to hide
memory latency.  There are also two level variants of both of these
policies where the scheduler schedules only from one level.
Occasionally, warps in the second level are added to the first level
and start getting scheduled.

CCWS (slide 40), explicitly takes the size of the cache into account.
It works by estimating the amount of cache space each warp needs, and
schedules warps such that their collective need does not exceed the
cache space.  This is a principle behind several schedulers that we
looked at.  CCWS estimates the cache needs by keeping a small cache of
recent victims from each warp.  If a warp experiences a miss, and the
address matches a tag in the victim cache, that means the miss was due
to a cache line that was removed recently and while it is still
active.  The system increases its estimate of the cache space needed
by that warp.w At any given time, we schedule warps such that their
collective cache requirement does not exceed the cache space.

Two variations on this general idea: static wavefront limiting,
profiles the execution to estimate the cache need for each warp and
provides that estimate to the scheduler.  Limitations of this approach
(and profiling in general) include in addition to the need to profile,
the fact that it cannot account for input dependent behavior, or track
warps whose usage varies over time.

The second variant is Divergence Aware Warp Scheduling (DAWS).  DAWS
tries to have the compiler estimate divergence and then maps the
divergence to estimate cache size requirements for each warp.  DAWS
focuses on loops, and then classifies memory accesses within loops as
divergent or non-divergent.  For divergent loads, it estimates the
control divergence and the size of the active mask, from which it
estimates the number of cache lines needed for that load (one cache
line for every active thread).  Thus, it replaces the reactive
component of CCWS, or the profile driven estimate of static wavefront
profiling, with a more intelligent and proactive estimate from the
compiler.

We also quickly overviewed some other schedulers.  Priority based
cache allocation (slide 50) observes that limiting the number of warps
by the L1 capacity leaves most of L2 unused.  So, it extends CCWS by
another level where those warps are scheduled but not cached in L1.
The benefit of having these extra warps active is that they increase
utilization if the other warps stall, and keeps more warps moving
reducing the impact of barriers (Where we have to wait for warps that
have not been scheduled to catch up).

On slide 51, we talked about criticality aware scheduling, where we
estimate the critical path of the computation and give more resources
(Scheduling slots and cache space) to the warps on the critical path.

From here we very briefly discussed three other research direction:
(1) Coherent CPU-GPU memory abstractions; (2) Synchronization support
and Transactional memory; and (3) power efficiency for GPGPUs.