Assignments
Introductory projects
We will have a of introductory projects where the fundamentals are learned. Students may work as a team of two.
MapReduce word count pdf
Spark Twitter hashtags
Cassandra queries
Dataset List
Course Project
Each project will have 3 parts:
Collect data. You may pick any dataset (or combinations of datasets) you like, the only constraint is that it must be at least 5 GB. Examples are:
Crawl web to get Web pages
Social network data. E.g., use Twitter streaming API, Google+ API, Instagram API (check if you are able to get more than 5 GB),
Crawl the Web to get images
Get public dataset, e.g., from http://archive.ics.uci.edu/ml/datasets.html, tweets, http:www.kaggle.com, http:/data.gov, or other source. If you get such an existing dataset, then the next two parts of the project should be more sophisticated to compensate for this convenience.
Preprocess or analyze your data in a distributed (parallel) way using Spark, and store the output in the key-value store Cassandra. E.g.,
Build text index (see http://en.wikipedia.org/wiki/Inverted_index) for Web pages
Locate shapes in images or any other analysis on images. You may use existing source code for image analysis and adapt it to work in Hadoop.
If you data is tabular, compute avg income by zipcode, or other group-by queries (tabular data are unlikely to be more than 50GB so you may want to combine with other data)
Find most popular hashtags in Twitter for every day, or build a spatial index that for each city, has a list of tweets.
Build Web interface (use your favorite web programming framework) to explore the preprocessed or analyzed data. E.g. (corresponding to above preprocessing tasks),
Allow searching pages by keyword. You could use Lucene as back-end, or build a simpler string matching algorithm from scratch.
Search images by shape
Do a simple OLAP-style exploration of the data; view on map avg incomes
View on map most popular hashtags for each city
Display heatmap of
The specifications can be adjusted if you speak with me about the project.
|