CS 167 - Introduction to Big-data
Time: Tuesday & Thursday - 2:00 PM to 3:20 PM
Location: Zoom. Check iLearn for Zoom link
Instructor: Ahmed Eldawy -
Office Hours: Monday & Thursday 11:00 - 11:50 AM
(Zoom link on iLearn)
TA: Payas Rajan -
Office Hours: Monday and Wednesday 2:00-3:00 PM (Zoom link on iLearn)
TA (MSOL): Xin Zhang -
Office Hours: Monday and Tuesday 6:00 - 7:30 PM (Zoom link on iLearn)
Syllabus
Textbook: Learning Spark Lightning-Fast Data Analytics (2nd Edition) by Jules S. Damji, Brooke Wenig, Tathagata Das & Denny Lee.
CS 167 covers the data management and systems aspects of big data platforms such as Hadoop, Spark, and AsterixDB. In this course, you will learn how the data is stored in a distributed file system and how the queries run in parallel. The course will cover the following topics.
- An overview of big data management systems
- Distributed big-data storage
- Programming models in big data (e.g., MapReduce and RDD)
- Column-based storage and analytics on big data
- Big spatial data
- Document Databases
- Machine learning on big data
- Big-data Visualization
Grade Breakdown
- (10%) Active class participation (Quizzes and activities)
- (15%) Assignments
- (30%) Labs
- (15%) Mid-term 1
- (15%) Mid-term 2
- (15%) Mid-term 3
Grading Scheme
Grade | Points |
---|---|
A+ | [97,100] |
A | [92,97[ |
A- | [90,92[ |
B+ | [87,90[ |
B | [83,87[ |
B- | [80,83[ |
C+ | [77,80[ |
C | [73,77[ |
C- | [70,73[ |
D+ | [67,70[ |
D | [63,67[ |
D- | [50,63[ |
F | [0,50[ |
Schedule
Date | Topic | Reading | Material |
---|---|---|---|
Tuesday 3/30 | Introduction to Big Data | Slides | |
Thursday 4/1 | A tour on big-data systems | Slides | |
Tuesday 4/6 | Hadoop Distributed File System (HDFS) | HDFS Architecture | Slides |
Thursday 4/8 | Hadoop Distributed File System (HDFS) | Class Activity | |
Tuesday 4/13 | Big-data Processing | Slides | |
Thursday 4/15 | MapReduce Computation | ||
Tuesday 4/20 | Resilient Distributed Datasets (RDD) | ||
Thursday 4/22 | Mid-term 1 | ||
Tuesday 4/27 | Resilient Distributed Datasets (RDD) | ||
Thursday 4/29 | Spark SQL | Slides | |
Tuesday 5/4 | Machine Learning Meets Big Data | Intro to ML Basic ML algorithms | Slides |
Thursday 5/6 | Machine Learning Meets Big Data | ||
Tuesday 5/11 | Big Spatial Data | Slides | |
Thursday 5/13 | Mid-term 2 | ||
Tuesday 5/18 | Big Spatial Data | ||
Thursday 5/20 | Semi-structured data storage/Parquet | JSON introduction Dremel Made Simple with Parquet | Slides |
Tuesday 5/25 | NoSQL and Document Data Bases/MongoDB | ||
Thursday 5/27 | NoSQL and Document Data Bases/MongoDB | Slides | |
Tuesday 6/1 | LSM Tree, Course Review & Next Steps | Slides | |
Thursday 6/3 | Mid-term 3 | ||
Labs
# | Topic | Due Date | Instructions |
---|---|---|---|
#1 | Development Setup | 4/5/2021 | Instructions |
#2 | HDFS | 4/12/2021 | Instructions |
#3 | MapReduce | 4/19/2021 | Instructions |
#4 | Spark Java | 4/26/2021 | Instructions |
#5 | Spark Scala | 5/3/2021 | Instructions |
#6 | Spark SQL | 5/10/2021 | Instructions |
Assignments
# | Topic | Due Date |
---|---|---|
#1 | HDFS | |
#2 | MapReduce | 5/4/2021 at 2:00 PM Pacific Time |
#3 | Spark RDD/SQL | Thursday, 5/13/2021 at 2:00 PM Pacific Time (Before class) |
#4 | ||
#5 |