Leveraging Code Snippets to Detect Variations in the Performance of HPC Systems

Abstract

Variations in the performance of parallel and distributed systems are becoming increasingly challenging. The runtimes of different executions can vary greatly even with a fixed number of computing nodes. Many HPC applications on supercomputers exhibit such variance. This not only leads to unpredictable execution times, but also renders the system’s behavior unintuitive. The efficient online detection of variations in performance is an open problem in HPC research. To solve it, we propose an approach, called vSensor, to detect variations in the performance of systems. The key finding of this study is that the source code of programs can better represent performance at runtime than an external detector. Specifically, many HPC applications contain code snippets that are fixed workload patterns of execution, e.g., the workload of an invariant quantity and a linearly growing workload. This observation allows us to automatically identify these snippets of workload-related code and use them to detect variations in performance. We evaluate vSensor on the Tianhe-2A system with a large number of parallel applications, and the results indicate that it can efficiently identify variations in system performance. The average overhead of 4,096 processes is less than 6% for fixed-workload v-sensors. We identify a problematic node with slow memory by using vSensor that degrades the performance of the program by 21%. A serious issue with network performance is also detected that slows down the Tianhe-2A system by 3.37 times for an HPC kernel.

Publication
IEEE Transactions on Parallel and Distributed Systems
Jidong Zhai
Jidong Zhai
Associate Professor
(特别研究员、博士生导师)
Liyan Zheng
Liyan Zheng
Ph.D. Student

zly

Wenguang Chen
Wenguang Chen
Professor
(教授)