Assignments

Introductory projects

We will have a of introductory projects where the fundamentals are learned. Students may work as a team of two.

  1. MapReduce word count pdf

  2. Spark Twitter hashtags

  3. Cassandra queries

Dataset List

  • And many more. This site has a comprehensive list of [https:www.forbes.comsitesbernardmarr20160212big-data-35-brilliant-and-free-data-sources-for-2016/

Course Project

Each project will have 3 parts:

  1. Collect data. You may pick any dataset (or combinations of datasets) you like, the only constraint is that it must be at least 5 GB. Examples are:

    1. Crawl web to get Web pages

    2. Social network data. E.g., use Twitter streaming API, Google+ API, Instagram API (check if you are able to get more than 5 GB),

    3. Crawl the Web to get images

    4. Get public dataset, e.g., from http://archive.ics.uci.edu/ml/datasets.html, tweets, http:www.kaggle.com, http:/data.gov, or other source. If you get such an existing dataset, then the next two parts of the project should be more sophisticated to compensate for this convenience.

  2. Preprocess or analyze your data in a distributed (parallel) way using Spark, and store the output in the key-value store Cassandra. E.g.,

    1. Build text index (see http://en.wikipedia.org/wiki/Inverted_index) for Web pages

    2. Locate shapes in images or any other analysis on images. You may use existing source code for image analysis and adapt it to work in Hadoop.

    3. If you data is tabular, compute avg income by zipcode, or other group-by queries (tabular data are unlikely to be more than 50GB so you may want to combine with other data)

    4. Find most popular hashtags in Twitter for every day, or build a spatial index that for each city, has a list of tweets.

  3. Build Web interface (use your favorite web programming framework) to explore the preprocessed or analyzed data. E.g. (corresponding to above preprocessing tasks),

    1. Allow searching pages by keyword. You could use Lucene as back-end, or build a simpler string matching algorithm from scratch.

    2. Search images by shape

    3. Do a simple OLAP-style exploration of the data; view on map avg incomes

    4. View on map most popular hashtags for each city

    5. Display heatmap of

The specifications can be adjusted if you speak with me about the project.