Scalable Task Parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and MemoryPublished: Wednesday, 19 October 2016
Scalable Task Parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and Memory Management by Andi Drebes. 26/10/16 KB 1.5 14:00
Scalable Task Parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and Memory Management by Andi Drebes.
26/10/16 in Kilburn L.T. 1.5 at 14:00
Dynamic task-parallel programming models are popular on shared-memory systems, promising enhanced scalability, load balancing and locality. Yet these promises are undermined by non-uniform memory access (NUMA). In this talk, we show that using NUMA-aware task and data placement, it is possible to preserve the uniform abstraction of both computing and memory resources for task-parallel programming models while achieving high data locality. The presented data placement scheme guarantees that all accesses to task output data target the local memory of the accessing core. The complementary task placement heuristic improves the locality of task input data on a best effort basis. The algorithms are fully automatic, application-independent, performance-portable across NUMA machines, and adapt to dynamic changes. Placement decisions use information about inter-task data dependences readily available in the run-time system and placement information from the operating system. We achieve 94% of local memory accesses on a 192-core system with 24 NUMA nodes, up to 5x higher performance than NUMA-aware hierarchical work-stealing, and even 5.6x compared to static interleaved allocation.
The second part of the talk introduces Aftermath, a graphical tool for the trace-based analysis of task-parallel and loop-parallel programs. We show how Aftermath enabled the development and implementation of the task and data placement heuristics and we illustrate the impact of the heuristics on data locality and performance with trace files from the experimental evaluation.
Andi Drebes is a post-doctoral research associate at the University of Manchester. He received his MSc and PhD both from Université Pierre et Marie Curie (Paris, France). His research focuses on the dynamic optimization of task-parallel applications for NUMA machines and performance analysis of parallel programs. He has contributed optimizations for NUMA-aware scheduling and NUMA-aware memory allocation to the OpenStream run-time system and he is the lead developer of Aftermath, a tool for the trace- based performance analysis of task-parallel and loop-parallel OpenMP and OpenStream programs.