CS236 Win14: Project Description
The goal of this project is to use the MapReduce framework to implement a well known database algorithm. Hadoop is a popular open-source implementation of the MapReduce framework, used to parallelize the algorithm execution onto several machines.
What you need to implement in this project
You are required to implement the Skyline computation algorithm using the MapReduce paradigm. In your implementation you should come up with a solution that leverages data parallelism provided by the MapReduce execution platform. Assume that data is horizontally partitioned and each mapper instance will process one partition. For example, if the input data set looks like:
ID | attr1 | attr2 |
---|---|---|
0 | 3 | 2 |
1 | 6 | 4 |
2 | 1 | 1 |
3 | 8 | 8 |
then horizontally partitioning means that if we have two mappers, some records, saying id=0 and 2, will be processed in one mapper, and the others will be in the other mapper.
Dataset description
Dataset consists of records, having 26 attributes. You can find more information about dataset file format here. When computing a skyline attribute values should be minimized/maximized. For this project, you should minimize/maximize the attributes according to the following table (note that attributes not mentioned in the table are irrelevant to the skyline computation):
MAX | MIN | |
---|---|---|
attr name | TEMP | STP |
DEWP | WDSP | |
SLP | MXSPD | |
MAX | GUST | |
MIN |
Do not assume that there is any index built on the data. You should not try to implement the index-based approaches discussed in class. Instead, think of simple, yet parallelizable approaches to implement a skyline. Some ideas appear in the original skyline paper below.
Hadoop environment
For developing purposes we recommend that you use Cloudera Hadoop VM.
Each group will be provided with an account on a Hadoop cluster. However we recommend using the cluster only for your final scalability tests.
What you should submit
You must write a report with all your hight-level algorithms, results, findings, errors/problems/bugs, as well as a detailed description of your source code.
Please show the pseudocode for both the mapper and the reducer components, and your Hadoop configuration including number of mappers and reducers. For the experiments, please show the total running time and also the mapping and reducing time separately. You should try to use at least 1, 2 and 4 reducers to show scalability of your approach. Similarly, you need to run at least 3 runs for the average running time.
You also have to report the answer (set of objectIDs, forming the skyline; objectID=concat(STN, ‘ _’, YEAR, ‘ _’, MODA)).
Along with your report (in PDF format), you should submit your source code along with a README file and run scripts that explain how to run your code and reproduce your experiments.
The project is to be done in groups of two students. In the document, explicitly enumerate the tasks that each member of your group was responsible for.
Deadline for the project
The deadline for this project is Friday, March 21, 11:59pm. Please submit tar.gz archive named “cs236w14_username1_username2” (username=your NetID) to me with the subject “[cs236]project submission”.