*
MixQ
Quantizing outliers is the main challenge when quantizing activation tensors in inferencing. Previous work has shown that the outliers are located in fixed channels. However, few of them identify the regularity of outliers when decoding tokens.
MagPy
Real-world deep learning programs are often developed with dynamic programming languages like Python, which usually have complex features, such as built-in functions and dynamic typing. These programs typically execute in eager mode, where tensor operators run without compilation, resulting in poor performance.
WiseGraph
Graph Neural Network (GNN) has emerged as an important workload for learning on graphs. With the size of graph data and the complexity of GNN model architectures increasing, developing an efficient GNN system grows more important.
GraphSet
GraphSet is a pattern-aware graph mining system supporting both CPU and GPU. GraphSet achieves high performance by proposing a set-based equivalent transformation approach to optimize pattern-aware graph mining applications, which can leverage classic set properties to eliminate most control flows and reduce computation overhead exponentially.
EinNet
Boosting the execution performance of deep neural networks (DNNs) is critical due to their wide adoption in real-world applications. However, existing approaches to optimizing the tensor computation of DNNs only consider transformations representable by a fixed set of predefined tensor operators, resulting in a highly restricted optimization space.
Vapro
Performance variance is a serious problem for parallel applications, which can cause performance degradation and make applications’ behavior hard to understand. Therefore, detecting and diagnosing performance variance are of crucial importance for users and application developers.
SmartMoE
Deep neural networks are growing large for stronger model ability, consuming enormous computation resources to train them. Sparsely activated models have been increasingly proposed and deployed to reduce training costs while enlarging model size.
APE
Domain-Specific Accelerators (DSAs) are being rapidly developed to support high-performance domain-specific computation. Although DSAs provide massive computation capability, they often only support limited native data types. To mitigate this problem, previous works have explored software emulation for certain data types, which provides some compensation for hardware limitations.
FreeTensor
A language and compiler for irregular tensor programs.
BaGuaLu
Large-scale pretrained AI models have shown state-of-the-art accuracy in a series of important applications. As the size of pretrained AI models grows dramatically each year in an effort to achieve higher accuracy, training such models requires massive computing and memory capabilities, which accelerates the convergence of AI and HPC.
ZZ Crosstalk Suppression
Noise is a significant obstacle to quantum computing, and ZZ crosstalk is one of the most destructive types of noise affecting superconducting qubits. Previous approaches to suppressing ZZ crosstalk have mainly relied on specific chip design that can complicate chip fabrication and aggravate decoherence.
PerFlow
PerFlow is a domain specific framework for performance analysis of parallel applications, which significantly reduces the burden of implementing performance analytical tasks.
FasterMoE
While FastMoE enables distributed MoE model training using PyTorch, it suffers inefficiency because of load imbalance and poor communication performance. Other state-of-the-art systems for MoE, such as GShard from Google and BASE Layers from Facebook, share the same issues.
Crosstalk Mitigation
Crosstalk is one of the major types of noise in quantum computers. To design high-fidelity quantum gates and large-scale quantum computers, effectively suppressing crosstalk is becoming increasingly important. Previous approaches to mitigate crosstalk rely on either hardware strategies, which are only applicable on limited platforms, or software techniques, which, however, cannot fully explore instruction parallelism.
Elan
Showing a promising future in improving resource utilization and accelerating training, elastic deep learning training has been attracting more and more attention recently. Nevertheless, existing approaches to provide elasticity have certain limitations.
CYPRESS
Communication traces are increasingly important, both for parallel applications’ performance analysis/optimization, and for designing next-generation HPC systems. Meanwhile, the problem size and the execution scale on supercomputers keep growing, producing prohibitive volume of communication traces.
ScalAna
ScalAna is an automatic tool for scaling loss detection. It uses hybrid static-dynamic analysis to reduce costs and leverages graph algorithms to root cause scaling issues.
GNN Optimizations
Graph Neural Network (GNN) has recently drawn a rapid increase of interest in many domains for its effectiveness in learning over graphs. Maximizing its performance is essential for many tasks, but remains preliminarily understood.
AIPerf
AIPerf is an end-to-end benchmark suite utilizing automated machine learning (AutoML). It represents real AI scenarios, and scales auto-adaptively to various scales of machines. The automl pipeline (including NAS, HPO, etc.
FastMoE
Mixture-of-Expert (MoE) presents a strong potential in enlarging the size of language model to trillions of parameters. However, training trillion-scale MoE requires algorithm and system co-design for a well-tuned high performance distributed training system.
Graphpi
Graph pattern matching, which aims to discover structural patterns in graphs, is considered one of the most fundamental graph mining problems in many real applications. Despite previous efforts, existing systems face two main challenges.
HyQuas
Quantum computing has shown its strong potential in solving certain important problems. Due to the intrinsic limitations of current real quantum computers, quantum circuit simulation still plays an important role in both research and development of quantum computing.
PET
High-performance tensor programs are critical for efficiently deploying deep neural network (DNN) models in realworld tasks. Existing frameworks optimize tensor programs by applying fully equivalent transformations, which maintain equivalence on every element of output tensors.
Spindle
Memory monitoring is of critical use in understanding applications and evaluating systems. Due to the dynamic nature in programs’ memory accesses, common practice today leaves large amounts of address examination and data recording at runtime, at the cost of substantial performance overhead (and large storage time/space consumption if memory traces are collected).
UniQ
Quantum circuit simulation is critical for verifying quantum computers. Given exponential complexity in the simulation, existing simulators use different architectures to accelerate the simulation. However, due to the variety of both simulation methods and modern architectures, it is challenging to design a high-performance yet portable simulator.