Yujia Zhai

Software Engineer at NVIDIA
yujiazhai94@gmail.com

GitHub | LinkedIn | Portofolio


Short Bio

I am a Software Engineer at NVIDIA Corporation. Previously, I obtained PhD degree in Computer Science at University of California, Riverside. Prior to the PhD study in Computer Science, I earned MS and BS degrees from Duke University and University of Science and Technology of China (USTC), both in Chemistry.

My technical skill set spans CUDA, DPC++, SYCL, x86 assembly, SIMD, OpenMP and MPI and am specialized in performance optimization for math libraries on CPUs, GPUs and heterogeneous platforms. I have interned with the XPU Architecture Research team at Intel Corporation working on GPU-accelerated homomorphic encryption. In 2022 Spring, I have worked on GPU-accelerated ML systems with the AML-MLsys team at ByteDance/TikTok USA. In 2022 Summer, I have been with the Fast Kernels team at NVIDIA Corporation as an engineering intern working on CUTLASS project.


Education

Ph.D. Computer Science, University of California, Riverside, 2018 - 2023

M.S. Theoretical and Computational Chemistry, Duke University, 2016 - 2018

B.S. Chemistry, University of Science and Technology of China, 2012 - 2016


Experience

  • Summer 2022: Deep Learning Library Performance Software Engineer Intern, Fast Kernels Team, NVIDIA Corp.
  • Spring 2022: PhD Research Intern, Data-AML @ ByteDance/TikTok Inc.
  • Summer 2021: GPU Compute and Deep Learning Software Engineer intern @ Intel Corp.
  • 2018-present: Research Assistant, SuperLab @ UC Riverside

  • Selected Projects

    Optimizing GEMM/GEMV on x86 CPUs and NVIDIA GPUs

    Mainly focusing on low-level performance optimization for math libraries, my PhD research is rooted in highly efficient hand-tuned GEMM/GEMV. Here I disassembled some common strategies that close-source commercial libraries (Intel oneMKL, NVIDIA cuBLAS) adopt for GEMM/GEMV optimizations.

  • On NVIDIA GPUs (tested on NVIDIA RTX 2080 Super, TU102)
  • On Intel CPUs (tested on Intel Xeon Gold W2255, Cascade Lake)
  • Accelerating Homomorphic Encryption on Intel GPUs [Paper]

  • The first-ever SYCL-based GPU backend for Microsoft SEAL APIs.
  • The first HE library based on the CKKS scheme optimized for Intel GPUs
  • Optimizating from instruction level, algorithmic level and application level to accelerate our HE library.
  • Our NTT implementations reaches up to 85.7% of the theoretical peak performance on latest Intel GPUs.
  • FT-BLAS: A High-Performance BLAS Implementation With Online Fault Tolerance [Paper]

  • Brand new BLAS implementation (Level-1/2/3) featuring Intel AVX512 instructions. Comparable/faster than state-of-the-art BLAS libraries on latest Intel CPUs.
  • Encoded fault tolerant codes into assembly kernels with negligible overhead (0.5% - 3%) added to the baseline.
  • The fault tolerant library remains comparable/faster than commercial libraries (MKL/OpenBLAS/BLIS) with/without runtime computing errors injected.

  • Selected Awards

  • Outstanding Teaching Assistant Award, University of California, Riverside, 2019 & 2020.
  • Dean's Distinguished Fellowship, University of California, Riverside, 2018.
  • Outstanding Teaching Assistant Award (3/825), USTC, 2016.
  • Outstanding Undergraduate Award, USTC, 2015.
  • Lan-Ying LIN Fellowship, USTC, 2014.
  • Xue-Zhou WU Felloship, USTC, 2013.
  • Outstanding Freshmen Award, USTC, 2012.
  • Silver Medal, Chinese Chemistry Olympiad (CChO), 2011.
  • First Prize (4th), Chinese Chemistry Olympiad in Provinces (Anhui), 2011.

  • Personnel

  • I became a fan of the Boston Celtics for more than a decade.
  • I was born and raised in Maanshan, a small and beautiful city in eastern Anhui Province, P. R. China.
  • I like Paul Pierce, Jay Chou, and Faye Wong.