Projects | Jidong Zhai

mTuner

With the growing importance of personalized large language models (LLMs) and fine-tuning techniques, parameter-efficient fine-tuning (PEFT) has emerged as a mainstream approach, offering reduced computational and storage demands compared to full-parameter fine-tuning.

HypeReca

Making high-quality recommendations is important in online applications. To improve user satisfaction and effectiveness of advertising, deep learning-based recommendermodels (DLRM) are widely studied and deployed. Training these models on massive data demands increasing computation power, commonly provided by a cluster of numerous GPUs.

QFactory

Quantization is a critical technique for accelerating large language models. To achieve tangible speedups, weight dequantization must be performed on-the-fly, necessitating tailored quantized kernels for various quantization algorithms and precision formats.

MixQ

Quantizing outliers is the main challenge when quantizing activation tensors in inferencing. Previous work has shown that the outliers are located in fixed channels. However, few of them identify the regularity of outliers when decoding tokens.

MagPy

Real-world deep learning programs are often developed with dynamic programming languages like Python, which usually have complex features, such as built-in functions and dynamic typing. These programs typically execute in eager mode, where tensor operators run without compilation, resulting in poor performance.

WiseGraph

Graph Neural Network (GNN) has emerged as an important workload for learning on graphs. With the size of graph data and the complexity of GNN model architectures increasing, developing an efficient GNN system grows more important.

GraphSet

GraphSet is a pattern-aware graph mining system supporting both CPU and GPU. GraphSet achieves high performance by proposing a set-based equivalent transformation approach to optimize pattern-aware graph mining applications, which can leverage classic set properties to eliminate most control ﬂows and reduce computation overhead exponentially.

EinNet

Boosting the execution performance of deep neural networks (DNNs) is critical due to their wide adoption in real-world applications. However, existing approaches to optimizing the tensor computation of DNNs only consider transformations representable by a fixed set of predefined tensor operators, resulting in a highly restricted optimization space.

Vapro

Performance variance is a serious problem for parallel applications, which can cause performance degradation and make applications’ behavior hard to understand. Therefore, detecting and diagnosing performance variance are of crucial importance for users and application developers.

SmartMoE

Deep neural networks are growing large for stronger model ability, consuming enormous computation resources to train them. Sparsely activated models have been increasingly proposed and deployed to reduce training costs while enlarging model size.

APE

Domain-Specific Accelerators (DSAs) are being rapidly developed to support high-performance domain-specific computation. Although DSAs provide massive computation capability, they often only support limited native data types. To mitigate this problem, previous works have explored software emulation for certain data types, which provides some compensation for hardware limitations.

FreeTensor

A language and compiler for irregular tensor programs.

BaGuaLu

Large-scale pretrained AI models have shown state-of-the-art accuracy in a series of important applications. As the size of pretrained AI models grows dramatically each year in an effort to achieve higher accuracy, training such models requires massive computing and memory capabilities, which accelerates the convergence of AI and HPC.

ZZ Crosstalk Suppression

Noise is a significant obstacle to quantum computing, and ZZ crosstalk is one of the most destructive types of noise affecting superconducting qubits. Previous approaches to suppressing ZZ crosstalk have mainly relied on specific chip design that can complicate chip fabrication and aggravate decoherence.

PerFlow

PerFlow is a domain specific framework for performance analysis of parallel applications, which significantly reduces the burden of implementing performance analytical tasks.

FasterMoE

While FastMoE enables distributed MoE model training using PyTorch, it suffers inefficiency because of load imbalance and poor communication performance. Other state-of-the-art systems for MoE, such as GShard from Google and BASE Layers from Facebook, share the same issues.

Crosstalk Mitigation

Crosstalk is one of the major types of noise in quantum computers. To design high-fidelity quantum gates and large-scale quantum computers, effectively suppressing crosstalk is becoming increasingly important. Previous approaches to mitigate crosstalk rely on either hardware strategies, which are only applicable on limited platforms, or software techniques, which, however, cannot fully explore instruction parallelism.

Elan

Showing a promising future in improving resource utilization and accelerating training, elastic deep learning training has been attracting more and more attention recently. Nevertheless, existing approaches to provide elasticity have certain limitations.

CYPRESS

Communication traces are increasingly important, both for parallel applications’ performance analysis/optimization, and for designing next-generation HPC systems. Meanwhile, the problem size and the execution scale on supercomputers keep growing, producing prohibitive volume of communication traces.

ScalAna

ScalAna is an automatic tool for scaling loss detection. It uses hybrid static-dynamic analysis to reduce costs and leverages graph algorithms to root cause scaling issues.

GNN Optimizations

Graph Neural Network (GNN) has recently drawn a rapid increase of interest in many domains for its effectiveness in learning over graphs. Maximizing its performance is essential for many tasks, but remains preliminarily understood.

AIPerf

AIPerf is an end-to-end benchmark suite utilizing automated machine learning (AutoML). It represents real AI scenarios, and scales auto-adaptively to various scales of machines. The automl pipeline (including NAS, HPO, etc.

FastMoE

Mixture-of-Expert (MoE) presents a strong potential in enlarging the size of language model to trillions of parameters. However, training trillion-scale MoE requires algorithm and system co-design for a well-tuned high performance distributed training system.

Graphpi

Graph pattern matching, which aims to discover structural patterns in graphs, is considered one of the most fundamental graph mining problems in many real applications. Despite previous efforts, existing systems face two main challenges.

HyQuas

Quantum computing has shown its strong potential in solving certain important problems. Due to the intrinsic limitations of current real quantum computers, quantum circuit simulation still plays an important role in both research and development of quantum computing.

PET

High-performance tensor programs are critical for efficiently deploying deep neural network (DNN) models in realworld tasks. Existing frameworks optimize tensor programs by applying fully equivalent transformations, which maintain equivalence on every element of output tensors.

Spindle

Memory monitoring is of critical use in understanding applications and evaluating systems. Due to the dynamic nature in programs’ memory accesses, common practice today leaves large amounts of address examination and data recording at runtime, at the cost of substantial performance overhead (and large storage time/space consumption if memory traces are collected).

UniQ

Quantum circuit simulation is critical for verifying quantum computers. Given exponential complexity in the simulation, existing simulators use different architectures to accelerate the simulation. However, due to the variety of both simulation methods and modern architectures, it is challenging to design a high-performance yet portable simulator.