Leveraging Code Snippets to Detect Variations in the Performance of HPC Systems

Jidong Zhai, Liyan Zheng, Jinghan Sun, Feng Zhang, Xiongchao Tang, Xuehai Qian, Bingsheng He, Wei Xue, Wenguang Chen, Weimin Zheng

December, 2022

Abstract

Variations in the performance of parallel and distributed systems are becoming increasingly challenging. The runtimes of different executions can vary greatly even with a fixed number of computing nodes. Many HPC applications on supercomputers exhibit such variance. This not only leads to unpredictable execution times, but also renders the system’s behavior unintuitive. The efficient online detection of variations in performance is an open problem in HPC research. To solve it, we propose an approach, called vSensor, to detect variations in the performance of systems. The key finding of this study is that the source code of programs can better represent performance at runtime than an external detector. Specifically, many HPC applications contain code snippets that are fixed workload patterns of execution, e.g., the workload of an invariant quantity and a linearly growing workload. This observation allows us to automatically identify these snippets of workload-related code and use them to detect variations in performance. We evaluate vSensor on the Tianhe-2A system with a large number of parallel applications, and the results indicate that it can efficiently identify variations in system performance. The average overhead of 4,096 processes is less than 6% for fixed-workload v-sensors. We identify a problematic node with slow memory by using vSensor that degrades the performance of the program by 21%. A serious issue with network performance is also detected that slows down the Tianhe-2A system by 3.37 times for an HPC kernel.

Type

Journal article

Publication

IEEE Transactions on Parallel and Distributed Systems

Leveraging Code Snippets to Detect Variations in the Performance of HPC Systems

Abstract

Jidong Zhai

Associate Professor
(特别研究员、博士生导师)

Liyan Zheng

Ph.D. Student

Wenguang Chen

Professor
(教授)

Leveraging Code Snippets to Detect Variations in the Performance of HPC Systems

Abstract

Jidong Zhai

Associate Professor(特别研究员、博士生导师)

Liyan Zheng

Ph.D. Student

Wenguang Chen

Professor(教授)

Associate Professor
(特别研究员、博士生导师)

Professor
(教授)