Vapro: Performance Variance Detection and Diagnosis for Production-Run Parallel Applications

Abstract

Performance variance is a serious problem for parallel applications, which can cause performance degradation and make applications’ behavior hard to understand. Therefore, detecting and diagnosing performance variance are of crucial importance for users and application developers. However, previous detection approaches either bring too large overhead and hurt applications’ performance, or rely on nontrivial source code analysis that is impractical for production-run parallel applications.In this work, we propose Vapro, a performance variance detection and diagnosis framework for production-run parallel applications. Our approach is based on an important observation that most parallel applications contain code snippets that are repeatedly executed with fixed workload, which can be used for performance variance detection. To effectively identify these snippets at runtime even without program source code, we introduce State Transition Graph (STG) to track program execution and then conduct lightweight workload analysis on STG to locate variance. To diagnose the detected variance, Vapro leverages a progressive diagnosis method based on a hybrid model leveraging variance breakdown and statistical analysis. Results show that the performance overhead of Vapro is only 1.38% on average. Vapro can detect the variance in real applications caused by hardware bugs, memory, and IO. After fixing the detected variance, the standard deviation of the execution time is reduced by up to 73.5%. Compared with the state-of-the-art variance detection tool based on source code analysis, Vapro achieves 30.0% higher detection coverage.

Publication
Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
Liyan Zheng
Liyan Zheng
Ph.D. Student

zly

Jidong Zhai
Jidong Zhai
Associate Professor
(特别研究员、博士生导师)
Yuyang Jin
Yuyang Jin
Postdoc
Wenguang Chen
Wenguang Chen
Professor
(教授)