About Me

January 10th, 2020 Comments off

Dr. Christian Engelmann is a Senior Scientist and the Intelligent Systems and Facilities Group Leader at Oak Ridge National Laboratory, which is the US Department of Energy’s (DOE) largest multiprogram science and technology laboratory with an annual budget of $2.2 billion. He has more than 20 years experience in software research and development for extreme-scale high-performance computing (HPC) systems with a strong funding and publication record. In collaboration with other laboratories and universities, Dr. Engelmann’s research solves computer science challenges in HPC software, such as scalability, dependability, energy efficiency, and portability.

His primary expertise is in HPC resilience, i.e., providing efficiency and correctness in the presence of faults, errors, and failures through avoidance, masking, and recovery. Dr. Engelmann is a leading expert in HPC resilience and was a member of the DOE Technical Council on HPC Resilience 2013-2015. He received the 2015 DOE Early Career Award for research in resilience design patterns for extreme scale HPC. His secondary expertise is in lightweight simulation of future-generation extreme-scale supercomputers with millions of processing units, studying the impact of hardware and software properties on the key HPC system design factors: performance, resilience, and power consumption.

Dr. Engelmann’s ongoing research program targets computer science challenges in machine-in-the-loop operational intelligence (OI) for smart systems, instruments and facilities. Leveraging operational data analytics in a loop control, machine-in-the-loop OI maximizes productivity and minimizes costs through adaptive autonomous operation. Application areas in HPC are: optimizing (i) scientific application performance and productivity, (ii) system performance and productivity and (iii) system and center operational costs and productivity. Application areas in federated instruments, laboratories and facilities include: (1) autonomous operation of instruments and laboratories, (2) optimizing the orchestration and utilization of federated instruments, laboratories and facilities, and (3) autonomous operation of federated instruments, laboratories and facilities. Some of the computer science research challenges are: (i) identification and collection of relevant operational data, (ii) combining offline with online data analytics, learning and decision making using artificial intelligence, (iii) understanding and modeling the involved trade-offs for decision making, (iv) design of experiments, and (iiv) leveraging community software tools for reusability and maintainability.

Dr. Engelmann earned a Dipl.-Ing. (FH), a German engineering degree and M.Sc. equivalent, in Computer Systems Engineering from the University of Applied Sciences Berlin, Germany, in 2001, a M.Sc. in Computer Science from the University of Reading, UK, also in 2001 as a conjoint degree, and a Ph.D. in Computer Science from the University of Reading in 2008. He is a Senior Member of the Association for Computing Machinery (ACM) and the Institute of Electrical and Electronics Engineers (IEEE). He is also a Member of the Society for Industrial and Applied Mathematics (SIAM) and the Advanced Computing Systems Association (USENIX).

Download the NSF-style 2-page bio. Download the full list of publications. Resume available upon request.

Contact Information

engelmannc@computer.org|engelmannc@ornl.gov
P.O. Box 2008, Oak Ridge, TN 37831-6164, USA
Tel.:+1 (865) 574-3132
Fax:+1 (865) 576-5491

View Christian Engelmann's profile on LinkedIn
View Christian Engelmann's profile on Google Scholar DBLP: Christian Engelmann ORCID iD iconorcid.org/0000-0003-4365-6416
Scopus ID: 18037364000

Professional Accomplishments

13  Research grants ($29.45M, 5 as lead investigator): 110  Peer-reviewed articles/papers: 3,715  Publication citations:
8 Current direct reports 11 Journal articles H-index: 29, i10-index: 66
8 Co-advised M.Sc. theses 55 Conference papers Erdős number: 3
4 Mentored summer faculty 44 Workshop papers 160  Committees at 44 conference series
56  Invited talks and seminars 12  Peer-reviewed conference posters 54  Journal article and book proposal reviews
Awards: 2015 US Department of Energy Early Career Research Award

In the News

2018-11-19: HPCwire: What’s New in HPC Research: Thrill for Big Data, Scaling Resilience and More
2018-08-05: inside HPC: Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems
2015-07-15: ASCR Discovery: Mounting a charge. Early-career awardees attack exascale computing on two fronts: Power and resilience
2015-07-15: HPC Wire: Tackling Power and Resilience at Exascale
2015-07-15: ComputerWorld Australia: Supercomputers face growing resilience problems

Most Cited Peer-reviewed Publications

Symbols: Abstract Abstract, Publication Publication, Presentation Presentation, BibTeX Citation BibTeX Citation

  1. Arun B. Nagarajan, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Fault Tolerance for HPC with Xen Virtualization. In Proceedings of the 21st ACM International Conference on Supercomputing (ICS) 2007, pages 23-32, Seattle, WA, USA, June 16-20, 2007. ACM Press, New York, NY, USA. ISBN 978-1-59593-768-1. DOI 10.1145/1274971.1274978. Acceptance rate 23.6% (29/123). 455 citations. Abstract Publication Presentation BibTeX Citation
  2. Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, Andrew A. Chien, Paul Coteus, Nathan A. Debardeleben, Pedro Diniz, Christian Engelmann, Mattan Erez, Saverio Fazzari, Al Geist, Rinku Gupta, Fred Johnson, Sriram Krishnamoorthy, Sven Leyffer, Dean Liberty, Subhasish Mitra, Todd Munson, Rob Schreiber, Jon Stearley, and Eric Van Hensbergen. Addressing Failures in Exascale Computing. International Journal of High Performance Computing Applications (IJHPCA), volume 28, number 2, pages 127-171, May 1, 2014. SAGE Publications. ISSN 1094-3420. DOI 10.1177/1094342014522573. 342 citations. Abstract Publication BibTeX Citation
  3. David Fiala, Frank Mueller, Christian Engelmann, Kurt Ferreira, Ron Brightwell, and Rolf Riesen. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. In Proceedings of the 25th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2012, pages 78:1-78:12, Salt Lake City, UT, USA, November 10-16, 2012. ACM Press, New York, NY, USA. ISBN 978-1-4673-0804-5. DOI 10.1109/SC.2012.49. Acceptance rate 21.2% (100/472). 293 citations. Abstract Publication Presentation BibTeX Citation
  4. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration in HPC Environments. In Proceedings of the 21st IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2008, pages 1-12, Austin, TX, USA, November 15-21, 2008. ACM Press, New York, NY, USA. ISBN 978-1-4244-2835-9. DOI 10.1145/1413370.1413414. Acceptance rate 21.3% (59/277). 204 citations. Abstract Publication Presentation BibTeX Citation
  5. James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. Combining Partial Redundancy and Checkpointing for HPC. In Proceedings of the 32nd International Conference on Distributed Computing Systems (ICDCS) 2012, pages 615-626, Macau, SAR, China, June 18-21, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4685-8. ISSN 1063-6927. DOI 10.1109/ICDCS.2012.56. Acceptance rate 13.8% (71/515). 159 citations. Abstract Publication Presentation BibTeX Citation
  6. Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, and Stephen L. Scott. Proactive Fault Tolerance Using Preemptive Migration. In Proceedings of the 17th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009, pages 252-257, Weimar, Germany, February 18-20, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3544-9. ISSN 1066-6192. DOI 10.1109/PDP.2009.31. Acceptance rate 42.0% (58/138). 112 citations. Abstract Publication Presentation BibTeX Citation
  7. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2007, pages 1-10, Long Beach, CA, USA, March 26-30, 2007. ACM Press, New York, NY, USA. ISBN 978-1-59593-768-1. DOI 10.1109/IPDPS.2007.370307. Acceptance rate 26% (109/419). 101 citations. Abstract Publication Presentation BibTeX Citation
  8. Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, Fei Meng, Xiaosong Ma, Youngjae Kim, Christian Engelmann, and Galen Shipman. Functional Partitioning to Optimize End-to-End Performance on Many-Core Architectures. In Proceedings of the 23rd IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2010, pages 1-12, New Orleans, LA, USA, November 13-19, 2010. ACM Press, New York, NY, USA. ISBN 978-1-4244-7559-9. DOI 10.1109/SC.2010.28. Acceptance rate 19.8% (50/253). 97 citations. Abstract Publication Presentation BibTeX Citation
  9. Christian Engelmann, Hong H. Ong, and Stephen L. Scott. The Case for Modular Redundancy in Large-Scale High Performance Computing Systems. In Proceedings of the 8th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009, pages 189-194, Innsbruck, Austria, February 16-18, 2009. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-784-0. 96 citations. Abstract Publication Presentation BibTeX Citation
  10. Chao Wang, Sudharshan S. Vazhkudai, Xiaosong Ma, Fei Meng, Youngjae Kim, and Christian Engelmann. NVMalloc: Exposing an Aggregate SSD Store as a Memory Partition in Extreme-Scale Machines. In Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2012, pages 957-968, Shanghai, China, May 21-25, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4675-9. DOI 10.1109/IPDPS.2012.90. Acceptance rate 20.7% (118/569). 80 citations. Abstract Publication Presentation BibTeX Citation

Most Recent Peer-reviewed Publications

Symbols: Abstract Abstract, Publication Publication, Presentation Presentation, BibTeX Citation BibTeX Citation

  1. Saurabh Hukerikar and Christian Engelmann. PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems. In Proceedings of the 25th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC) 2020, Perth, Australia, December 1-4, 2020. IEEE Computer Society, Los Alamitos, CA, USA. To appear. Abstract BibTeX Citation
  2. George Ostrouchov, Don Maxwell, Rizwan Ashraf, Christian Engelmann, Mallikarjun Shankar, and James Rogers. GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability. In Proceedings of the 33rd IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2020, Atlanta, GA, USA, November 15-20, 2020. ACM Press, New York, NY, USA. Preliminary acceptance rate 17.9% (68/380). To appear. Abstract BibTeX Citation
  3. Mohit Kumar and Christian Engelmann. Models for Resilience Design Patterns. In Proceedings of the 33rd International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2020: 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2020, Atlanta, GA, USA, November 11, 2020. IEEE Computer Society, Los Alamitos, CA, USA. Acceptance rate 66.7% (6/9). To appear. Abstract BibTeX Citation
  4. Haewon Jeong, Yaoqing Yang, Christian Engelmann, Vipul Gupta, Tze Meng Low, Pulkit Grover, Viveck Cadambe, and Kannan Ramchandran. 3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication. In Lecture Notes in Computer Science: Proceedings of the 26th European Conference on Parallel and Distributed Computing (Euro-Par) 2020, pages 392-407, Warsaw, Poland, August 24-28, 2020. Springer Verlag, Berlin, Germany. ISBN 978-3-030-57674-5. DOI 10.1007/978-3-030-57675-2_25. Acceptance rate 24.5% (39/159). Abstract Publication BibTeX Citation
  5. Piyush Sao, Christian Engelmann, Srinivas Eswar, Oded Green, and Richard Vuduc. Self-stabilizing Connected Components. In Proceedings of the 32nd International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2019: 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2019, pages 50-59, Denver, CO, USA, November 22, 2019. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-7281-6013-9. DOI 10.1109/FTXS49593.2019.00011. Acceptance rate 60.0% (6/10). Abstract Publication Presentation BibTeX Citation
  6. Christian Engelmann, Geoffroy R. Vallée, and Swaroop Pophale. Concepts for OpenMP Target Offload Resilience. In Lecture Notes in Computer Science: Proceedings of the 15th International Workshop on OpenMP (IWOMP) 2019, pages 78-93, Auckland, New Zealand, September 11-13, 2019. Springer Verlag, Berlin, Germany. ISBN 978-3-030-28595-1. DOI 10.1007/978-3-030-28596-8_6. Abstract Publication Presentation BibTeX Citation
  7. Yawei Hui, Byung Hoon (Hoony) Park, and Christian Engelmann. A Comprehensive Informative Metric for Analyzing HPC System Status using the LogSCAN Platform. In Proceedings of the 31st International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2018: 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2018, pages 29-38, Dallas, TX, USA, November 16, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-7281-0222-1. DOI 10.1109/FTXS.2018.00007. Acceptance rate 45.0% (9/20). Abstract Publication Presentation BibTeX Citation
  8. Rizwan Ashraf and Christian Engelmann. Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer. In Proceedings of the 31st International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2018: 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2018, pages 39-48, Dallas, TX, USA, November 16, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-7281-0222-1. DOI 10.1109/FTXS.2018.00008. Acceptance rate 45.0% (9/20). Abstract Publication Presentation BibTeX Citation
  9. Byung Hoon (Hoony) Park, Yawei Hui, Swen Boehm, Rizwan Ashraf, Christian Engelmann, and Christopher Layton. A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log. In Proceedings of the 19th IEEE International Conference on Cluster Computing (Cluster) 2018: 5th Workshop on Monitoring and Analysis for High Performance Systems Plus Applications (HPCMASPA) 2018, pages 571-579, Belfast, UK, September 10, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-8319-4. ISSN 2168-9253. DOI 10.1109/CLUSTER.2018.00073. Abstract Publication Presentation BibTeX Citation
  10. Amogh Katti, Giuseppe Di Fatta, Thomas Naughton, and Christian Engelmann. Epidemic Failure Detection and Consensus for Extreme Parallelism. International Journal of High Performance Computing Applications (IJHPCA), volume 32, number 5, pages 729-743, September 1, 2018. SAGE Publications. ISSN 1094-3420. DOI 10.1177/1094342017690910. Abstract Publication BibTeX Citation