Experience

Amazon Applied Scientist Internship

Topic 1: Analyze Amazon Day customer behavior and explore machine learning opportunity (2020 Summer)

  • Amazon Day Delivery Program;
  • Data collection and customer behavior analysis: SQL and spark;
  • Customer acquisition model: Random Forest and XGBoost;
  • Evaluation method: ROC, AUC, Precision-recall, feature importance;

Topic 2: Amazon Day customer behavior and explore machine learning opportunity (2019 Summer)

  • Data collection and data analysis with SQL and PySpark;
  • Customer acquisition model is implemented on Random Forest and XGBoost;

Teaching Assistant at UC, Riverside

  • CS 010, CS012, CS014 (Topic: Data Structures and Algorithms)
  • CS 005, CS006 (Topic: Introduction to Computer Science)

Course Projects

Web Developer

Topic: Design of search engine for Wikipedia (demo: WiKiSearch)

  • Craw web pages: multiple threads to increase speed; noise removal to optimize the database storage;
  • Index web pages: Hadoop MapReduce for higher throughput of index process;
  • Score algorithm: BM25, Proximity and Page-Rank; User Interface design: MVC framework on Node.js;
  • Database: SQLITE to store web pages, MongoDB to store index;

Machine Learning and Data Mining

Topic: Label prompt for Yelp image

  • Collect images and labels from yelp website.
  • Train classifier with k-NN, SVM and Artificial Neural Network.
  • Optimize hyperparameter of classifier using cross validation.

Research

Non-linear Computing Lab, UC Riverside

Topic: Machine learning-assisted resource management in computing systems

The widespread adoption of Internet of Things and latency-critical applications has fueled the burgeoning development of edge colocation data centers (a.k.a., edge colocation) - small-scale data centers in distributed locations. Due to limited resources and demand for low latency, we conduct several explorations for resource management in edge computing systems using machine learning. We study resource management from the perspective of both data center operator and users (attacker).

Firstly, we propose battery-assisted power management in edge data centers considering the computing performance and thermal behavior under significant workload fluctuations. In particular, the workload fluctuations allow the battery to be frequently recharged and made available for temporary capacity boosts. But, using batteries can overload the data center cooling system which is designed with a matching capacity of the power system. We design a novel power management solution, DeepPM, that exploits the UPS battery and cold air inside the edge data center as energy storage to boost performance. DeepPM uses deep reinforcement learning (DRL) to learn the data center thermal behavior online in a model-free manner and uses it on-the-fly to determine power allocation for optimum latency performance without overheating the data center.

Next, we study the vulnerability and thermal attack opportunities from the mismatch between power load and cooling load in edge colocation data centers. We discover that the sharing of cooling systems also exposes edge colocations' potential vulnerabilities to cooling load injection attacks (called thermal attacks) by an attacker which, if left at large, may create thermal emergencies and even trigger system outages. Importantly, thermal attacks can be launched by leveraging the emerging architecture of built-in batteries integrated with servers that can conceal the attacker's actual server power (or cooling load). We consider both one-shot attacks (which aim at creating system outages) repeated attacks (which aim at causing frequent thermal emergencies). For repeated attacks, we present a foresighted attack strategy which, using reinforcement learning, learns on the fly a good timing for attacks based on the battery state and benign tenants' load.

Topic: Calibration and accuracy monitoring for deep neural networks on operational dataset

  • Increasing trustworthiness of deep neural networks via accuracy monitoring.
  • Calibrating deep neural network classifiers on out-of-distribution datasets.