NestedMP: Enabling cache-aware thread mapping for nested parallel shared memory applications

Abstract

It is beneficial to exploit multiple levels of parallelism for a wide range of applications, because a typical server already has tens of processor cores now. As the number of cores in a computer is increasing rapidly, efficient support of nested parallelism will be more and more important. We observe that different task-core mapping schemas may result significant performance difference because modern HPC servers are NUMA multi-core systems. So it is important to control the task-core mapping for nested parallelism. However, the number of threads management mechanism in current parallel programming models, such as OpenMP, does not provide enough information for runtime systems to make optimized decision. As a result, current nested parallel applications often suffer from suboptimal task-core mapping and get significant performance loss. To address this problem, we propose NestedMP, a set of directives which extends OpenMP. NestedMP specifies the number of threads of each nested parallel branch in a declarative way and allows runtime systems to see the whole picture of task trees to make locality-aware task-core mapping. We have implemented NestedMP in GCC 4.8.2 and tested the performance on a 4-way 8-core SandyBridge server. The result shows NestedMP improves the performance significantly over GCC’s OpenMP implementation.

Publication
Parallel Computing
Wenguang Chen
Wenguang Chen
Professor
(教授)