Lab 2 - Reduction

DUE Monday, May 18

Overview

The objective of this assignment is to implement an optimized reduction kernel and analyze basic architectural performance properties.

Optimized Reduction

In the first step, you will implement an optimized reduction kernel with optimized thread indexing. Recall that we discussed two types of reduction in class: naive and optimized. The naive reduction kernel implementation suffers from significant warp divergence due to naive thread indexing. The optimized version avoids a significant number of warp divergence. The goal is to have the thread indexing behave as shown in slide 33 of the Reduction slides.

For this lab, we will be using Github Classroom.
Please join the classroom by clicking the following link: https://classroom.github.com/a/GSLjLGtI Once you join the classroom, a private github repository will automatically be created with the starter code.
Simply git clone to copy the starter code to Bender.
Clone the git repository. There should be 5 files: kernel.cu, main.cu, Makefile, support.cu, support.h
The size of the input array is a command line argument. If none is given, 1 million is the default size. Note that you only have to accumulate the partial sums into an array, which is copied back to the host and verified to check the final answer. To ensure consistency when grading, do not change the srad seed value.
Complete the Naive and Optimized reduction kernel by adding your code to kernel.cu. There should be no changes necessary for main.cu.
Verify the code works
Bonus Both version compute partial sums and return a small array of sums to the CPU to complete. As a bonus implement a version that completes the partial sum on the gpu and returns only the complete sum

Answer the following questions:

Assume we run reduction with an input size of 1,000,000 Note that some of these questions are conceptual and can be answered without the programming assignment.

For the naive reduction kernel, how many steps execute without divergence? How many steps execute with divergence?
For the optimized reduction kernel, how many steps execute without divergence? How many steps execute with divergence?
Why do GPGPUs suffer from warp divergence?
Which kernel performed better? Implement both naive and optimized code to compare the timing of each kernel.

Submission

Commit and push your completed Optimized Reduction code to the github repository. (You only need to modify kernel.cu.)
Answer the previous questions by including a report document in the repository. Please name your report report.pdf or report.txt or report.docx, etc.

Please include your name in the report.