2018-… rOpenMP: A Resilient Parallel Programming Model for Heterogeneous Systems

June 3rd, 2019

Resilience, i.e., obtaining a correct solution in a timely and efficient manner, is one of the key challenges in extreme-scale supercomputing. Extreme heterogeneity, i.e., using multiple, and potentially configurable, types of processors, accelerators and memory/storage in a single computing platform, is adding a significant amount of complexity to the supercomputer hardware/software ecosystem. Errors and failures reported by such heterogeneous hardware will need to be handled by the appropriate software component to enable efficient masking, recovery, and avoidance with little burden on the user.

This project takes a first step toward resilience in leadership-class supercomputers with extreme heterogeneity. It performs research to enable fine-grain resilience for graphics processing units accelerated systems, such as ORNL’s Summit, that is more efficient than traditional application-level checkpoint/restart. The approach centers on a novel concept for Quality of Service (QoS) and corresponding extensions for the for OpenMP parallel programming model. This project develops (1) error and failure models, (2) software resilience strategies and protection domains, (3) OpenMP QoS language extensions for resilience, (4) OpenMP QoS runtime extensions and policies for resilience, and (5) a proof-of-concept prototype demonstrating these capabilities on Summit.

The ultimate goal is to make fault resilience an integral part of the supercomputer hardware/software ecosystem, such that the burden for providing it is on the system by design and not on the user as an afterthought.

Funding Sources

Participating Institutions

Peer-reviewed Workshop Publications

Symbols: Abstract Abstract, Publication Publication, Presentation Presentation, BibTeX Citation BibTeX Citation

  1. Christian Engelmann, Geoffroy R. Vallée, and Swaroop Pophale. Concepts for OpenMP Target Offload Resilience. In Lecture Notes in Computer Science: Proceedings of the 15th International Workshop on OpenMP (IWOMP) 2019, Auckland, New Zealand, September 11-13, 2019. Springer Verlag, Berlin, Germany. Abstract BibTeX Citation
Comments are closed.