Abstract Performance analysis is essential for understanding the performance behaviors of large-scale parallel applications on modern supercomputers. Current performance analysis techniques are based on either profiling or tracing. Profiling incurs low costs during runtime but misses important information for identifying underlying bottlenecks, while tracing brings unacceptable overhead at large scales.