The Homepage of Yujia Zhai

Yujia Zhai

Software Engineer at NVIDIA
yujiazhai94@gmail.com

Short Bio

I am a Software Engineer at NVIDIA Corporation. Previously, I obtained PhD degree in Computer Science at University of California, Riverside. Prior to the PhD study in Computer Science, I earned MS and BS degrees from Duke University and University of Science and Technology of China (USTC), both in Chemistry.

My technical skill set spans CUDA, DPC++, SYCL, x86 assembly, SIMD, OpenMP and MPI and am specialized in performance optimization for math libraries on CPUs, GPUs and heterogeneous platforms. I have interned with the XPU Architecture Research team at Intel Corporation working on GPU-accelerated homomorphic encryption. In 2022 Spring, I have worked on GPU-accelerated ML systems with the AML-MLsys team at ByteDance/TikTok USA. In 2022 Summer, I have been with the Fast Kernels team at NVIDIA Corporation as an engineering intern working on CUTLASS project.

Education

Ph.D. Computer Science, University of California, Riverside, 2018 - 2023

M.S. Theoretical and Computational Chemistry, Duke University, 2016 - 2018

B.S. Chemistry, University of Science and Technology of China, 2012 - 2016

Experience

Summer 2022: Deep Learning Library Performance Software Engineer Intern, Fast Kernels Team, NVIDIA Corp.

Accelerating DL/ML routines using CUTLASS framework.

Spring 2022: PhD Research Intern, Data-AML @ ByteDance/TikTok Inc.

Research intern with the MLsys team.

Summer 2021: GPU Compute and Deep Learning Software Engineer intern @ Intel Corp.

Engineering intern with the XPU Architecture Research team.
GPU-accelerated Homomorphic Encryption under the CKKS scheme.
A first-author research paper published at IPDPS'22 during this industrial R&D journey.

2018-present: Research Assistant, SuperLab @ UC Riverside

BLAS development and optimization on x86 CPUs featuring Intel AVX512 instructions.
Just-in-time code generation/auto tuning.

Selected Projects

Optimizing GEMM/GEMV on x86 CPUs and NVIDIA GPUs

Mainly focusing on low-level performance optimization for math libraries, my PhD research is rooted in highly efficient hand-tuned GEMM/GEMV. Here I disassembled some common strategies that close-source commercial libraries (Intel oneMKL, NVIDIA cuBLAS) adopt for GEMM/GEMV optimizations.

On NVIDIA GPUs (tested on NVIDIA RTX 2080 Super, TU102)

SGEMM (tiling, warp-level tiling, register blocking, prefetching, double buffer). [code] [tutorial]
SGEMV (register blocking, vectorization).[code])

On Intel CPUs (tested on Intel Xeon Gold W2255, Cascade Lake)

DGEMM (register blocking, cache blocking, packing, SIMD, prefetching). [code] [tutorial]
DGEMV (register blocking, SIMD, OpenMP multithreading).[code])

Accelerating Homomorphic Encryption on Intel GPUs [Paper]

The first-ever SYCL-based GPU backend for Microsoft SEAL APIs.

The first HE library based on the CKKS scheme optimized for Intel GPUs

Optimizating from instruction level, algorithmic level and application level to accelerate our HE library.

Our NTT implementations reaches up to 85.7% of the theoretical peak performance on latest Intel GPUs.

FT-BLAS: A High-Performance BLAS Implementation With Online Fault Tolerance [Paper]

Brand new BLAS implementation (Level-1/2/3) featuring Intel AVX512 instructions. Comparable/faster than state-of-the-art BLAS libraries on latest Intel CPUs.

Encoded fault tolerant codes into assembly kernels with negligible overhead (0.5% - 3%) added to the baseline.

The fault tolerant library remains comparable/faster than commercial libraries (MKL/OpenBLAS/BLIS) with/without runtime computing errors injected.

Selected Awards

Outstanding Teaching Assistant Award, University of California, Riverside, 2019 & 2020.

Dean's Distinguished Fellowship, University of California, Riverside, 2018.

Outstanding Teaching Assistant Award (3/825), USTC, 2016.

Outstanding Undergraduate Award, USTC, 2015.

Lan-Ying LIN Fellowship, USTC, 2014.

Xue-Zhou WU Felloship, USTC, 2013.

Outstanding Freshmen Award, USTC, 2012.

Silver Medal, Chinese Chemistry Olympiad (CChO), 2011.

First Prize (4^th), Chinese Chemistry Olympiad in Provinces (Anhui), 2011.

Personnel

I became a fan of the Boston Celtics for more than a decade.

I was born and raised in Maanshan, a small and beautiful city in eastern Anhui Province, P. R. China.

I like Paul Pierce, Jay Chou, and Faye Wong.