Skip to content

Publications

Peer-reviewed Journal Papers

  1. Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo Bautista-Gomez, Tommaso Benacchio, Luca Bonaventura, Hans-Joachim Bungartz, Sanjay Chatterjee, Florina M. Ciorba, Nathan DeBardeleben, Daniel Drzisga, Sebastian Eibl, Christian Engelmann, Wilfried N. Gansterer, Luc Giraud, Dominik Göddeke, Marco Heisig, Fabienne Jézéquel, Nils Kohl, Xiaoye Sherry Li, Romain Lion, Miriam Mehl, Paul Mycek, Michael Obersteiner, Enrique S. Quintana-Ortí, Francesco Rizzi, Ulrich Rüde, Martin Schulz, Fred Fung, Robert Speck, Linda Stals, Keita Teranishi, Samuel Thibault, Dominik Thönnes, Andreas Wagner, and Barbara Wohlmuth. Resiliency in Numerical Algorithm Design for Extreme Scale Simulations. International Journal of High Performance Computing Applications (IJHPCA), volume 36, number 2, pages 251-285, March 1, 2022. SAGE Publications. ISSN 1094-3420. DOI 10.1177/10943420211055188. Abstract Publication BibTeX Citation
  2. Mohit Kumar, Saurabh Gupta, Tirthak Patel, Michael Wilder, Weisong Shi, Song Fu, Christian Engelmann, and Devesh Tiwari. Study of Interconnect Errors, Network Congestion, and Applications Characteristics for Throttle Prediction on a Large Scale HPC System. Journal of Parallel and Distributed Computing (JPDC), volume 153, pages 29-43, July 1, 2021. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0743-7315. DOI 10.1016/j.jpdc.2021.03.001. Abstract Publication BibTeX Citation
  3. Amogh Katti, Giuseppe Di Fatta, Thomas Naughton, and Christian Engelmann. Epidemic Failure Detection and Consensus for Extreme Parallelism. International Journal of High Performance Computing Applications (IJHPCA), volume 32, number 5, pages 729-743, September 1, 2018. SAGE Publications. ISSN 1094-3420. DOI 10.1177/1094342017690910. Abstract Publication BibTeX Citation
  4. Saurabh Hukerikar and Christian Engelmann. Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale. Journal of Supercomputing Frontiers and Innovations (JSFI), volume 4, number 3, pages 4-42, October 1, 2017. South Ural State University Chelyabinsk, Russia. ISSN 2409-6008. DOI 10.14529/jsfi170301. Abstract Publication BibTeX Citation
  5. Christian Engelmann and Thomas Naughton. A New Deadlock Resolution Protocol and Message Matching Algorithm for the Extreme-scale Simulator. Concurrency and Computation: Practice and Experience, volume 28, number 12, pages 3369-3389, August 1, 2016. John Wiley & Sons, Inc.. ISSN 1532-0634. DOI 10.1002/cpe.3805. Abstract Publication BibTeX Citation
  6. Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, Andrew A. Chien, Paul Coteus, Nathan A. Debardeleben, Pedro Diniz, Christian Engelmann, Mattan Erez, Saverio Fazzari, Al Geist, Rinku Gupta, Fred Johnson, Sriram Krishnamoorthy, Sven Leyffer, Dean Liberty, Subhasish Mitra, Todd Munson, Rob Schreiber, Jon Stearley, and Eric Van Hensbergen. Addressing Failures in Exascale Computing. International Journal of High Performance Computing Applications (IJHPCA), volume 28, number 2, pages 127-171, May 1, 2014. SAGE Publications. ISSN 1094-3420. DOI 10.1177/1094342014522573. Abstract Publication BibTeX Citation
  7. Christian Engelmann. Scaling To A Million Cores And Beyond: Using Light-Weight Simulation to Understand The Challenges Ahead On The Road To Exascale. Future Generation Computer Systems (FGCS), volume 30, number 0, pages 59-65, January 1, 2014. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0167-739X. DOI 10.1016/j.future.2013.04.014. Abstract Publication BibTeX Citation
  8. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration and Back Migration in HPC Environments. Journal of Parallel and Distributed Computing (JPDC), volume 72, number 2, pages 254-267, February 1, 2012. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0743-7315. DOI 10.1016/j.jpdc.2011.10.009. Abstract Publication BibTeX Citation
  9. Stephen L. Scott, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, Christian Engelmann, and Hong H. Ong. System-Level Virtualization Research at Oak Ridge National Laboratory. Future Generation Computer Systems (FGCS), volume 26, number 3, pages 304-307, March 1, 2010. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0167-739X. DOI 10.1016/j.future.2009.07.001. Abstract Publication BibTeX Citation
  10. Xubin (Ben) He, Li Ou, Christian Engelmann, Xin Chen, and Stephen L. Scott. Symmetric Active/Active Metadata Service for High Availability Parallel File Systems. Journal of Parallel and Distributed Computing (JPDC), volume 69, number 12, pages 961-973, December 1, 2009. Elsevier B.V, Amsterdam, The Netherlands. ISSN 0743-7315. DOI 10.1016/j.jpdc.2009.08.004. Abstract Publication BibTeX Citation
  11. Xubin (Ben) He, Li Ou, Martha J. Kosa, Stephen L. Scott, and Christian Engelmann. A Unified Multiple-Level Cache for High Performance Cluster Storage Systems. International Journal of High Performance Computing and Networking (IJHPCN), volume 5, number 1-2, pages 97-109, November 14, 2007. Inderscience Publishers, Geneve, Switzerland. ISSN 1740-0562. DOI 10.1504/IJHPCN.2007.015768. Abstract Publication BibTeX Citation
  12. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Symmetric Active/Active High Availability for High-Performance Computing System Services. Journal of Computers (JCP), volume 1, number 8, pages 43-54, December 1, 2006. Academy Publisher, Oulu, Finland. ISSN 1796-203X. DOI 10.4304/jcp.1.8.43-54. Abstract Publication BibTeX Citation
  13. Christian Engelmann, Stephen L. Scott, David E. Bernholdt, Narasimha R. Gottumukkala, Chokchai (Box) Leangsuksun, Jyothish Varma, Chao Wang, Frank Mueller, Aniruddha G. Shet, and Ponnuswamy (Saday) Sadayappan. MOLAR: Adaptive Runtime Support for High-End Computing Operating and Runtime Systems. ACM SIGOPS Operating Systems Review (OSR), volume 40, number 2, pages 63-72, April 1, 2006. ACM Press, New York, NY, USA. ISSN 0163-5980. DOI 10.1145/1131322.1131337. Abstract Publication BibTeX Citation

Peer-reviewed Conference Papers

  1. Christian Engelmann and Suhas Somnath. Science Use Case Design Patterns for Autonomous Experiments. In Proceedings of the 28th European Conference on Pattern Languages of Programs (EuroPLoP) 2023, pages 1-14, Kloster Irsee, Germany, July 5-9, 2023. ACM Press, New York, NY, USA. ISBN 979-8-4007-0040-8. DOI 10.1145/3628034.3628060. Abstract Publication BibTeX Citation
  2. Christian Engelmann, Olga Kuchar, Swen Boehm, Michael J. Brim, Thomas Naughton, Suhas Somnath, Scott Atchley, Jack Lange, Ben Mintz, and Elke Arenholz. The INTERSECT Open Federated Architecture for the Laboratory of the Future. In Communications in Computer and Information Science (CCIS): Accelerating Science and Engineering Discoveries Through Integrated Research Infrastructure for Experiment, Big Data, Modeling and Simulation. 18th Smoky Mountains Computational Sciences & Engineering Conference (SMC) 2022, pages 173-190, August 24-25, 2022. Springer, Cham. ISBN 978-3-031-23605-1. DOI 10.1007/978-3-031-23606-8_11. Acceptance rate 32.4% (24/74). Abstract Publication Presentation BibTeX Citation
  3. Saurabh Hukerikar and Christian Engelmann. PLEXUS: A Pattern-Oriented Runtime System Architecture for Resilient Extreme-Scale High-Performance Computing Systems. In Proceedings of the 25th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC) 2020, pages 31-39, Perth, Australia, December 1-4, 2020. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-7281-8004-5. ISSN 1555-094X. DOI 10.1109/PRDC50213.2020.00014. Acceptance rate 40.9% (18/44). Abstract Publication BibTeX Citation
  4. George Ostrouchov, Don Maxwell, Rizwan Ashraf, Christian Engelmann, Mallikarjun Shankar, and James Rogers. GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability. In Proceedings of the 33rd IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2020, pages 41:1-14, Atlanta, GA, USA, November 15-20, 2020. ACM Press, New York, NY, USA. ISBN 9781728199986. DOI 10.1109/SC41405.2020.00045. Acceptance rate 25.1% (95/378). Abstract Publication Presentation BibTeX Citation
  5. Haewon Jeong, Yaoqing Yang, Christian Engelmann, Vipul Gupta, Tze Meng Low, Pulkit Grover, Viveck Cadambe, and Kannan Ramchandran. 3D Coded SUMMA: Communication-Efficient and Robust Parallel Matrix Multiplication. In Lecture Notes in Computer Science: Proceedings of the 26th European Conference on Parallel and Distributed Computing (Euro-Par) 2020, pages 392-407, Warsaw, Poland, August 24-28, 2020. Springer Verlag, Berlin, Germany. ISBN 978-3-030-57674-5. DOI 10.1007/978-3-030-57675-2_25. Acceptance rate 24.5% (39/159). Abstract Publication Presentation BibTeX Citation
  6. Mohit Kumar, Saurabh Gupta, Tirthak Patel, Michael Wilder, Weisong Shi, Song Fu, Christian Engelmann, and Devesh Tiwari. Understanding and Analyzing Interconnect Errors and Network Congestion on a Large Scale HPC System. In Proceedings of the 48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018, pages 107-114, Luxembourg City, Luxembourg, June 25-28, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-5596-2. ISSN 2158-3927. DOI 10.1109/DSN.2018.00023. Acceptance rate 27.2% (62/228). Abstract Publication BibTeX Citation
  7. Bin Nie, Ji Xue, Saurabh Gupta, Tirthak Patel, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. Machine Learning Models for GPU Error Prediction in a Large Scale HPC System. In Proceedings of the 48th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2018, pages 95-106, Luxembourg City, Luxembourg, June 25-28, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-5596-2. ISSN 2158-3927. DOI 10.1109/DSN.2018.00022. Acceptance rate 27.2% (62/228). Abstract Publication BibTeX Citation
  8. Rizwan Ashraf, Saurabh Hukerikar, and Christian Engelmann. Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing. In Proceedings of the 9th ACM/SPEC International Conference on Performance Engineering (ICPE) 2018, pages 80-87, Berlin, Germany, April 9-13, 2018. ACM Press, New York, NY, USA. ISBN 978-1-4503-5095-2. DOI 10.1145/3184407.3184421. Acceptance rate 23.7% (14/59). Abstract Publication Presentation BibTeX Citation
  9. Rizwan Ashraf, Saurabh Hukerikar, and Christian Engelmann. Shrink or Substitute: Handling Process Failures in HPC Systems using In-situ Recovery. In Proceedings of the 26th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2018, pages 178-185, Cambridge, UK, March 21-23, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-4975-6. ISSN 2377-5750. DOI 10.1109/PDP2018.2018.00032. Acceptance rate 29.3% (27/92). Abstract Publication Presentation BibTeX Citation
  10. Saurabh Gupta, Tirthak Patel, Christian Engelmann, and Devesh Tiwari. Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications. In Proceedings of the 30th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2017, pages 44:1-44:12, Denver, CO, USA, November 12-17, 2017. ACM Press, New York, NY, USA. ISBN 978-1-4503-5114-0. DOI 10.1145/3126908.3126937. Acceptance rate 18.7% (61/327). Abstract Publication Presentation BibTeX Citation
  11. Bin Nie, Ji Xue, Saurabh Gupta, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities. In Proceedings of the 25th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS) 2017, pages 22-31, Banff, AB, Canada, September 20-22, 2017. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-2764-8. ISSN 2375-0227. DOI 10.1109/MASCOTS.2017.12. Acceptance rate 30.95% (26/84). Abstract Publication BibTeX Citation
  12. Saurabh Hukerikar and Christian Engelmann. A Pattern Language for High-Performance Computing Resilience. In Proceedings of the 22nd European Conference on Pattern Languages of Programs (EuroPLoP) 2017, pages 12:1-12:16, Kloster Irsee, Germany, July 12-16, 2017. ACM Press, New York, NY, USA. ISBN 978-1-4503-4848-5. DOI 10.1145/3147704.3147718. Abstract Publication BibTeX Citation
  13. Mahesh Lagadapati, Frank Mueller, and Christian Engelmann. Benchmark Generation and Simulation at Extreme Scale. In Proceedings of the 20th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT) 2016, pages 9-18, London, UK, September 21-23, 2016. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5090-3506-9. ISSN 1550-6525. DOI 10.1109/DS-RT.2016.18. Acceptance rate 42.0% (21/50). Best paper candidate. Abstract Publication Presentation BibTeX Citation
  14. Saurabh Hukerikar and Christian Engelmann. Havens: Explicit Reliable Memory Regions for HPC Applications. In Proceedings of the 20th IEEE High Performance Extreme Computing Conference (HPEC) 2016, pages 1-6, Waltham, MA, USA, September 13-15, 2016. IEEE Computer Society, Los Alamitos, CA, USA. DOI 10.1109/HPEC.2016.7761593. Abstract Publication Presentation BibTeX Citation
  15. Kun Tang, Devesh Tiwari, Saurabh Gupta, Ping Huang, QiQi Lu, Christian Engelmann, and Xubin He. Power-Capping Aware Checkpointing: On the Interplay Among Power-Capping, Temperature, Reliability, Performance, and Energy. In Proceedings of the 46th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) 2016, pages 311-322, Toulouse, France, June 28 – July 1, 2016. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 2158-3927. DOI 10.1109/DSN.2016.36. Acceptance rate 22.4% (58/259). Abstract Publication BibTeX Citation
  16. David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. Mini-Ckpts: Surviving OS Failures in Persistent Memory. In Proceedings of the 30th ACM International Conference on Supercomputing (ICS) 2016, pages 7:1-7:14, Istanbul, Turkey, June 1-3, 2016. ACM Press, New York, NY, USA. ISBN 978-1-4503-4361-9. DOI 10.1145/2925426.2926295. Acceptance rate 24.2% (43/178). Abstract Publication Presentation BibTeX Citation
  17. Leonardo Bautista-Gomez, Ana Gainaru, Swann Perarnau, Devesh Tiwari, Saurabh Gupta, Franck Cappello, Christian Engelmann, and Marc Snir. Reducing Waste in Extreme Scale Systems Through Introspective Analysis. In Proceedings of the 30th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016, pages 212-221, Chicago, IL, USA, May 23-27, 2016. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 1530-2075. DOI 10.1109/IPDPS.2016.100. Acceptance rate 23.0% (114/496). Abstract Publication Presentation BibTeX Citation
  18. Christian Engelmann and Thomas Naughton. Supporting the Development of Soft-Error Resilient Message Passing Applications using Simulation. In Proceedings of the 13th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2016, Innsbruck, Austria, February 15-16, 2016. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-979-0. DOI 10.2316/P.2016.834-005. Abstract Publication Presentation BibTeX Citation
  19. Amogh Katti, Giuseppe Di Fatta, Thomas Naughton, and Christian Engelmann. Scalable and Fault Tolerant Failure Detection and Consensus. In Proceedings of the 22nd European MPI Users` Group Meeting (EuroMPI) 2015, pages 13:1-13:9, Bordeaux, France, September 21-24, 2015. ACM Press, New York, NY, USA. ISBN 978-1-4503-3795-3. DOI 10.1145/2802658.2802660. Acceptance rate 48.3% (14/29). Abstract Publication Presentation BibTeX Citation
  20. Christian Engelmann and Thomas Naughton. A Network Contention Model for the Extreme-scale Simulator. In Proceedings of the 34th IASTED International Conference on Modelling, Identification and Control (MIC) 2015, Innsbruck, Austria, February 17-18, 2015. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-975-2. DOI 10.2316/P.2015.826-043. Abstract Publication Presentation BibTeX Citation
  21. Christian Engelmann and Thomas Naughton. Improving the Performance of the Extreme-scale Simulator. In Proceedings of the 18th IEEE/ACM International Symposium on Distributed Simulation and Real Time Applications (DS-RT) 2014, pages 198-207, Toulouse, France, October 1-3, 2014. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4799-6143-6. ISSN 1550-6525. DOI 10.1109/DS-RT.2014.32. Best paper candidate. Abstract Publication Presentation BibTeX Citation
  22. Thomas Naughton, Christian Engelmann, Geoffroy Vallée, and Swen Böhm. Supporting the Development of Resilient Message Passing Applications using Simulation. In Proceedings of the 22nd Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2014, pages 271-278, Turin, Italy, February 12-14, 2014. IEEE Computer Society, Los Alamitos, CA, USA. ISSN 1066-6192. DOI 10.1109/PDP.2014.74. Acceptance rate 32.6% (73/224). Abstract Publication Presentation BibTeX Citation
  23. Geoffroy Vallée, Thomas Naughton, Swen Böhm, and Christian Engelmann. A Runtime Environment for Supporting Research in Resilient HPC System Software & Tools. In Proceedings of the 1st International Symposium on Computing and Networking – Across Practical Development and Theoretical Research – (CANDAR) 2013, pages 213-219, Matsuyama, Japan, December 4-6, 2013. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4799-2795-1. DOI 10.1109/CANDAR.2013.38. Acceptance rate 35.8% (28/78). Abstract Publication Presentation BibTeX Citation
  24. Christian Engelmann. Investigating Operating System Noise in Extreme-Scale High-Performance Computing Systems using Simulation. In Proceedings of the 11th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2013, Innsbruck, Austria, February 11-13, 2013. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-943-1. DOI 10.2316/P.2013.795-010. Abstract Publication Presentation BibTeX Citation
  25. David Fiala, Frank Mueller, Christian Engelmann, Kurt Ferreira, Ron Brightwell, and Rolf Riesen. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. In Proceedings of the 25th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2012, pages 78:1-78:12, Salt Lake City, UT, USA, November 10-16, 2012. ACM Press, New York, NY, USA. ISBN 978-1-4673-0804-5. DOI 10.1109/SC.2012.49. Acceptance rate 21.2% (100/472). Abstract Publication Presentation BibTeX Citation
  26. James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. Combining Partial Redundancy and Checkpointing for HPC. In Proceedings of the 32nd International Conference on Distributed Computing Systems (ICDCS) 2012, pages 615-626, Macau, SAR, China, June 18-21, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4685-8. ISSN 1063-6927. DOI 10.1109/ICDCS.2012.56. Acceptance rate 13.8% (71/515). Abstract Publication Presentation BibTeX Citation
  27. Chao Wang, Sudharshan S. Vazhkudai, Xiaosong Ma, Fei Meng, Youngjae Kim, and Christian Engelmann. NVMalloc: Exposing an Aggregate SSD Store as a Memory Partition in Extreme-Scale Machines. In Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2012, pages 957-968, Shanghai, China, May 21-25, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4675-9. DOI 10.1109/IPDPS.2012.90. Acceptance rate 20.7% (118/569). Abstract Publication Presentation BibTeX Citation
  28. Swen Böhm and Christian Engelmann. File I/O for MPI Applications in Redundant Execution Scenarios. In Proceedings of the 20th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2012, pages 112-119, Garching, Germany, February 15-17, 2012. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4633-9. ISSN 1066-6192. DOI 10.1109/PDP.2012.22. Abstract Publication Presentation BibTeX Citation
  29. Swen Böhm and Christian Engelmann. xSim: The Extreme-Scale Simulator. In Proceedings of the International Conference on High Performance Computing and Simulation (HPCS) 2011, pages 280-286, Istanbul, Turkey, July 4-8, 2011. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-61284-383-4. DOI 10.1109/HPCSim.2011.5999835. Acceptance rate 28.1% (48/171). Abstract Publication Presentation BibTeX Citation
  30. Christian Engelmann and Swen Böhm. Redundant Execution of HPC Applications with MR-MPI. In Proceedings of the 10th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2011, pages 31-38, Innsbruck, Austria, February 15-17, 2011. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-864-9. DOI 10.2316/P.2011.719-031. Abstract Publication Presentation BibTeX Citation
  31. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Hybrid Checkpointing for MPI Jobs in HPC Environments. In Proceedings of the 16th IEEE International Conference on Parallel and Distributed Systems (ICPADS) 2010, pages 524-533, Shanghai, China, December 8-10, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4307-9. DOI 10.1109/ICPADS.2010.48. Acceptance rate 29.6% (77/188). Abstract Publication Presentation BibTeX Citation
  32. Min Li, Sudharshan S. Vazhkudai, Ali R. Butt, Fei Meng, Xiaosong Ma, Youngjae Kim, Christian Engelmann, and Galen Shipman. Functional Partitioning to Optimize End-to-End Performance on Many-Core Architectures. In Proceedings of the 23rd IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2010, pages 1-12, New Orleans, LA, USA, November 13-19, 2010. ACM Press, New York, NY, USA. ISBN 978-1-4244-7559-9. DOI 10.1109/SC.2010.28. Acceptance rate 19.8% (50/253). Abstract Publication Presentation BibTeX Citation
  33. Swen Böhm, Christian Engelmann, and Stephen L. Scott. Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments. In Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications (HPCC) 2010, pages 72-78, Melbourne, Australia, September 1-3, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4214-0. DOI 10.1109/HPCC.2010.32. Acceptance rate 19.1% (58/304). Abstract Publication Presentation BibTeX Citation
  34. Antonina Litvinova, Christian Engelmann, and Stephen L. Scott. A Proactive Fault Tolerance Framework for High-Performance Computing. In Proceedings of the 9th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2010, Innsbruck, Austria, February 16-18, 2010. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-783-3. DOI 10.2316/P.2010.676-024. Abstract Publication Presentation BibTeX Citation
  35. Narate Taerat, Nichamon Naksinehaboon, Clayton Chandler, James Elliott, Chokchai (Box) Leangsuksun, George Ostrouchov, Stephen L. Scott, and Christian Engelmann. Blue Gene/L Log Analysis and Time to Interrupt Estimation. In Proceedings of the 4th International Conference on Availability, Reliability and Security (ARES) 2009, pages 173-180, Fukuoka, Japan, March 16-19, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4244-3572-2. DOI 10.1109/ARES.2009.105. Acceptance rate 25.0% (40/160). Abstract Publication BibTeX Citation
  36. Christian Engelmann, Hong H. Ong, and Stephen L. Scott. Evaluating the Shared Root File System Approach for Diskless High-Performance Computing Systems. In Proceedings of the 10th LCI International Conference on High-Performance Clustered Computing (LCI) 2009, Boulder, CO, USA, March 9-12, 2009. Abstract Publication Presentation BibTeX Citation
  37. Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, and Stephen L. Scott. Proactive Fault Tolerance Using Preemptive Migration. In Proceedings of the 17th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009, pages 252-257, Weimar, Germany, February 18-20, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3544-9. ISSN 1066-6192. DOI 10.1109/PDP.2009.31. Acceptance rate 42.0% (58/138). Abstract Publication Presentation BibTeX Citation
  38. Alessandro Valentini, Christian Di Biagio, Fabrizio Batino, Guido Pennella, Fabrizio Palma, and Christian Engelmann. High Performance Computing with Harness over InfiniBand. In Proceedings of the 17th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009, pages 151-154, Weimar, Germany, February 18-20, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3544-9. ISSN 1066-6192. DOI 10.1109/PDP.2009.64. Acceptance rate 42.0% (58/138). Abstract Publication BibTeX Citation
  39. Christian Engelmann, Hong H. Ong, and Stephen L. Scott. The Case for Modular Redundancy in Large-Scale High Performance Computing Systems. In Proceedings of the 8th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009, pages 189-194, Innsbruck, Austria, February 16-18, 2009. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-784-0. Abstract Publication Presentation BibTeX Citation
  40. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration in HPC Environments. In Proceedings of the 21st IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2008, pages 1-12, Austin, TX, USA, November 15-21, 2008. ACM Press, New York, NY, USA. ISBN 978-1-4244-2835-9. DOI 10.1145/1413370.1413414. Acceptance rate 21.3% (59/277). Abstract Publication Presentation BibTeX Citation
  41. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Symmetric Active/Active Replication for Dependent Services. In Proceedings of the 3rd International Conference on Availability, Reliability and Security (ARES) 2008, pages 260-267, Barcelona, Spain, March 4-7, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3102-1. DOI 10.1109/ARES.2008.64. Acceptance rate 21.1% (40/190). Abstract Publication Presentation BibTeX Citation
  42. Geoffroy R. Vallée, Kulathep Charoenpornwattana, Christian Engelmann, Anand Tikotekar, Chokchai (Box) Leangsuksun, Thomas Naughton, and Stephen L. Scott. A Framework For Proactive Fault Tolerance. In Proceedings of the 3rd International Conference on Availability, Reliability and Security (ARES) 2008, pages 659-664, Barcelona, Spain, March 4-7, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3102-1. DOI 10.1109/ARES.2008.171. Acceptance rate 21.1% (40/190). Abstract Publication Presentation BibTeX Citation
  43. Björn Könning, Christian Engelmann, Stephen L. Scott, and George A. (Al) Geist. Virtualized Environments for the Harness High Performance Computing Workbench. In Proceedings of the 16th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2008, pages 133-140, Toulouse, France, February 13-15, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3089-5. DOI 10.1109/PDP.2008.14. Acceptance rate 40% (83/207). Abstract Publication Presentation BibTeX Citation
  44. Geoffroy R. Vallée, Thomas Naughton, Christian Engelmann, Hong H. Ong, and Stephen L. Scott. System-level Virtualization for High Performance Computing. In Proceedings of the 16th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2008, pages 636-643, Toulouse, France, February 13-15, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3089-5. DOI 10.1109/PDP.2008.85. Acceptance rate 40% (83/207). Abstract Publication Presentation BibTeX Citation
  45. Li Ou, Christian Engelmann, Xubin (Ben) He, Xin Chen, and Stephen L. Scott. Symmetric Active/Active Metadata Service for Highly Available Cluster Storage Systems. In Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS) 2007, Cambridge, MA, USA, November 19-21, 2007. ACTA Press, Calgary, AB, Canada. ISBN 978-0-88986-703-1. Acceptance rate 49%. Abstract Publication Presentation BibTeX Citation
  46. Emanuele Di Saverio, Marco Cesati, Christian Di Biagio, Guido Pennella, and Christian Engelmann. Distributed Real-Time Computing with Harness. In Lecture Notes in Computer Science: Proceedings of the 14th European PVM/MPI Users` Group Meeting (EuroPVM/MPI) 2007, pages 281-288, Paris, France, September 30 – October 3, 2007. Springer Verlag, Berlin, Germany. ISBN 978-3-540-75415-2. ISSN 0302-9743. DOI 10.1007/978-3-540-75416-9_39. Abstract Publication Presentation BibTeX Citation
  47. Li Ou, Xubin (Ben) He, Christian Engelmann, and Stephen L. Scott. A Fast Delivery Protocol for Total Order Broadcasting. In Proceedings of the 16th IEEE International Conference on Computer Communications and Networks (ICCCN) 2007, pages 730-734, Honolulu, HI, USA, August 13-16, 2007. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-42441-251-8. ISSN 1095-2055. DOI 10.1109/ICCCN.2007.4317904. Acceptance rate 29.1% (160/550). Abstract Publication Presentation BibTeX Citation
  48. Arun B. Nagarajan, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Fault Tolerance for HPC with Xen Virtualization. In Proceedings of the 21st ACM International Conference on Supercomputing (ICS) 2007, pages 23-32, Seattle, WA, USA, June 16-20, 2007. ACM Press, New York, NY, USA. ISBN 978-1-59593-768-1. DOI 10.1145/1274971.1274978. Acceptance rate 23.6% (29/123). Abstract Publication Presentation BibTeX Citation
  49. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. On Programming Models for Service-Level High Availability. In Proceedings of the 2nd International Conference on Availability, Reliability and Security (ARES) 2007, pages 999-1006, Vienna, Austria, April 10-13, 2007. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7695-2775-2. DOI 10.1109/ARES.2007.109. Acceptance rate 28.3% (60/212). Abstract Publication Presentation BibTeX Citation
  50. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2007, pages 1-10, Long Beach, CA, USA, March 26-30, 2007. ACM Press, New York, NY, USA. ISBN 978-1-59593-768-1. DOI 10.1109/IPDPS.2007.370307. Acceptance rate 26% (109/419). Abstract Publication Presentation BibTeX Citation
  51. Kai Uhlemann, Christian Engelmann, and Stephen L. Scott. JOSHUA: Symmetric Active/Active Replication for Highly Available HPC Job and Resource Management. In Proceedings of the 8th IEEE International Conference on Cluster Computing (Cluster) 2006, pages 1-10, Barcelona, Spain, September 25-28, 2006. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 1-4244-0328-6. ISSN 1552-5244. DOI 10.1109/CLUSTR.2006.311855. Acceptance rate 33.1% (42/127). Abstract Publication Presentation BibTeX Citation
  52. Ronald Baumann, Christian Engelmann, and George A. (Al) Geist. A Parallel Plug-in Programming Paradigm. In Lecture Notes in Computer Science: Proceedings of the 7th International Conference on High Performance Computing and Communications (HPCC) 2006, pages 823-832, Munich, Germany, September 13-15, 2006. Springer Verlag, Berlin, Germany. ISBN 978-3-540-39368-9. ISSN 0302-9743. DOI 10.1007/11847366_85. Abstract Publication Presentation BibTeX Citation
  53. Jyothish Varma, Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems. In Proceedings of the 20th ACM International Conference on Supercomputing (ICS) 2006, pages 219-228, Cairns, Australia, June 28-30, 2006. ACM Press, New York, NY, USA. ISBN 1-59593-282-8. DOI 10.1145/1183401.1183433. Acceptance rate 26.2% (37/141). Abstract Publication Presentation BibTeX Citation
  54. Daniel I. Okunbor, Christian Engelmann, and Stephen L. Scott. Exploring Process Groups for Reliability, Availability and Serviceability of Terascale Computing Systems. In Proceedings of the 2nd International Conference on Computer Science and Information Systems 2006, Athens, Greece, June 19-21, 2006. Abstract Publication BibTeX Citation
  55. Kshitij Limaye, Chokchai (Box) Leangsuksun, Zeno Greenwood, Stephen L. Scott, Christian Engelmann, Richard M. Libby, and Kasidit Chanchio. Job-Site Level Fault Tolerance for Cluster and Grid Environments. In Proceedings of the 7th IEEE International Conference on Cluster Computing (Cluster) 2005, pages 1-9, Boston, MA, USA, September 26-30, 2005. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7803-9486-0. ISSN 1552-5244. DOI 10.1109/CLUSTR.2005.347043. Acceptance rate 39.6% (45/138). Abstract Publication BibTeX Citation
  56. Hertong Song, Chokchai (Box) Leangsuksun, Raja Nassar, Yudan Liu, Christian Engelmann, and Stephen L. Scott. UML-based Beowulf Cluster Availability Modeling. In International Conference on Software Engineering Research and Practice (SERP) 2005, pages 161-167, Las Vegas, NV, USA, June 27-30, 2005. CSREA Press. ISBN 1-932415-49-1. BibTeX Citation
  57. Christian Engelmann and George A. (Al) Geist. Super-Scalable Algorithms for Computing on 100,000 Processors. In Lecture Notes in Computer Science: Proceedings of the 5th International Conference on Computational Science (ICCS) 2005, Part I, pages 313-320, Atlanta, GA, USA, May 22-25, 2005. Springer Verlag, Berlin, Germany. ISBN 978-3-540-26032-5. ISSN 0302-9743. DOI 10.1007/11428831_39. Acceptance rate 35%. Abstract Publication Presentation BibTeX Citation

Peer-reviewed Workshop Papers

  1. Mohit Kumar and Christian Engelmann. RDPM: An Extensible Tool for Resilience Design Patterns Modeling. In Lecture Notes in Computer Science: Proceedings of the 27th European Conference on Parallel and Distributed Computing (Euro-Par) 2021 Workshops: 14th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 283-297, Lisbon, Portugal, August 30, 2021. Springer Verlag, Berlin, Germany. ISBN 978-3-031-06155-4. DOI 10.1007/978-3-031-06156-1_23. Acceptance rate 66.7% (4/6). Abstract Publication BibTeX Citation
  2. Mohit Kumar and Christian Engelmann. Models for Resilience Design Patterns. In Proceedings of the 33rd International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2020: 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2020, pages 21-30, Atlanta, GA, USA, November 11, 2020. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7381-1080-6. DOI 10.1109/FTXS51974.2020.00008. Acceptance rate 66.7% (6/9). Abstract Publication Presentation BibTeX Citation
  3. Piyush Sao, Christian Engelmann, Srinivas Eswar, Oded Green, and Richard Vuduc. Self-stabilizing Connected Components. In Proceedings of the 32nd International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2019: 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2019, pages 50-59, Denver, CO, USA, November 22, 2019. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-7281-6013-9. DOI 10.1109/FTXS49593.2019.00011. Acceptance rate 60.0% (6/10). Abstract Publication Presentation BibTeX Citation
  4. Christian Engelmann, Geoffroy R. Vallée, and Swaroop Pophale. Concepts for OpenMP Target Offload Resilience. In Lecture Notes in Computer Science: Proceedings of the 15th International Workshop on OpenMP (IWOMP) 2019, pages 78-93, Auckland, New Zealand, September 11-13, 2019. Springer Verlag, Berlin, Germany. ISBN 978-3-030-28595-1. DOI 10.1007/978-3-030-28596-8_6. Abstract Publication Presentation BibTeX Citation
  5. Yawei Hui, Byung Hoon (Hoony) Park, and Christian Engelmann. A Comprehensive Informative Metric for Analyzing HPC System Status using the LogSCAN Platform. In Proceedings of the 31st International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2018: 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2018, pages 29-38, Dallas, TX, USA, November 16, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-7281-0222-1. DOI 10.1109/FTXS.2018.00007. Acceptance rate 45.0% (9/20). Abstract Publication Presentation BibTeX Citation
  6. Rizwan Ashraf and Christian Engelmann. Analyzing the Impact of System Reliability Events on Applications in the Titan Supercomputer. In Proceedings of the 31st International Conference on High Performance Computing, Networking, Storage and Analysis (SC) Workshops 2018: 8th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2018, pages 39-48, Dallas, TX, USA, November 16, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-7281-0222-1. DOI 10.1109/FTXS.2018.00008. Acceptance rate 45.0% (9/20). Abstract Publication Presentation BibTeX Citation
  7. Byung Hoon (Hoony) Park, Yawei Hui, Swen Boehm, Rizwan Ashraf, Christian Engelmann, and Christopher Layton. A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log. In Proceedings of the 19th IEEE International Conference on Cluster Computing (Cluster) 2018: 5th Workshop on Monitoring and Analysis for High Performance Systems Plus Applications (HPCMASPA) 2018, pages 571-579, Belfast, UK, September 10, 2018. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-8319-4. ISSN 2168-9253. DOI 10.1109/CLUSTER.2018.00073. Abstract Publication Presentation BibTeX Citation
  8. Rizwan Ashraf and Christian Engelmann. Performance Efficient Multiresilience using Checkpoint Recovery in Iterative Algorithms. In Lecture Notes in Computer Science: Proceedings of the 24th European Conference on Parallel and Distributed Computing (Euro-Par) 2018 Workshops: 11th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 813-825, Turin, Italy, August 28, 2018. Springer Verlag, Berlin, Germany. ISBN 978-3-030-10549-5. DOI 10.1007/978-3-030-10549-5_63. Acceptance rate 50.0% (4/8). Abstract Publication Presentation BibTeX Citation
  9. Byung Hoon (Hoony) Park, Saurabh Hukerikar, Christian Engelmann, and Ryan Adamson. Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale. In Proceedings of the 18th IEEE International Conference on Cluster Computing (Cluster) 2017: 4th Workshop on Monitoring and Analysis for High Performance Systems Plus Applications (HPCMASPA) 2017, pages 758-765, Honolulu, HI, USA, September 5, 2017. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-5386-2327-5. ISSN 2168-9253. DOI 10.1109/CLUSTER.2017.113. Abstract Publication Presentation BibTeX Citation
  10. Saurabh Hukerikar and Christian Engelmann. Pattern-based Modeling of High-Performance Computing Resilience. In Lecture Notes in Computer Science: Proceedings of the 23rd European Conference on Parallel and Distributed Computing (Euro-Par) 2017 Workshops: 10th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 557-568, Santiago de Compostela, Spain, August 29, 2017. Springer Verlag, Berlin, Germany. ISBN 978-3-319-75177-1. DOI 10.1007/978-3-319-75178-8_45. Acceptance rate 66.7% (4/6). Abstract Publication Presentation BibTeX Citation
  11. Saurabh Hukerikar, Rizwan Ashraf, and Christian Engelmann. Towards New Metrics for High-Performance Computing Resilience. In Proceedings of the 26th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC) 2017: 7th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2017, pages 23-30, Washington, D.C., June 26-30, 2017. ACM Press, New York, NY, USA. ISBN 978-1-4503-5001-3. DOI 10.1145/3086157.3086163. Acceptance rate 83.3% (5/6). Abstract Publication Presentation BibTeX Citation
  12. Saurabh Hukerikar and Christian Engelmann. Language Support for Reliable Memory Regions. In Lecture Notes in Computer Science: Proceedings of the 29th International Workshop on Languages and Compilers for Parallel Computing, pages 73-87, Rochester, NY, USA, September 28-30, 2016. Springer Verlag, Berlin, Germany. ISBN 978-3-319-52708-6. ISSN 0302-9743. DOI 10.1007/978-3-319-52709-3_6. Acceptance rate 76.9% (20/26). Abstract Publication Presentation BibTeX Citation
  13. Thomas Naughton, Christian Engelmann, Geoffroy Vallée, Ferrol Aderholdt, and Stephen L. Scott. A Cooperative Approach to Virtual Machine Based Fault Injection. In Lecture Notes in Computer Science: Proceedings of the 22nd European Conference on Parallel and Distributed Computing (Euro-Par) 2016 Workshops: 9th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 671-682, Grenoble, France, August 23, 2016. Springer Verlag, Berlin, Germany. ISBN 978-3-319-58943-5. ISSN 0302-9743. DOI 10.1007/978-3-319-58943-5_54. Acceptance rate 55.6% (5/9). Abstract Publication Presentation BibTeX Citation
  14. Zachary Parchman, Geoffroy R. Vallée, Thomas Naughton, Christian Engelmann, and David E. Bernholdt. Adding Fault Tolerance to NPB Benchmarks Using ULFM. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC) 2016: 6th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2016, pages 19-26, Kyoto, Japan, May 31 – June 4, 2016. ACM Press, New York, NY, USA. ISBN 978-1-4503-4349-7. DOI 10.1145/2909428.2909429. Acceptance rate 85.7% (6/7). Abstract Publication Presentation BibTeX Citation
  15. Thomas Naughton, Garry Smith, Christian Engelmann, Geoffroy Vallée, Ferrol Aderholdt, and Stephen L. Scott. What is the right balance for performance and isolation with virtualization in HPC?. In Lecture Notes in Computer Science: Proceedings of the 20th European Conference on Parallel and Distributed Computing (Euro-Par) 2014 Workshops: 7th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 570-581, Porto, Portugal, August 25, 2014. Springer Verlag, Berlin, Germany. ISBN 978-3-319-14325-5. ISSN 0302-9743. DOI 10.1007/978-3-319-14325-5_49. Acceptance rate 60.0% (6/10). Abstract Publication Presentation BibTeX Citation
  16. Christian Engelmann and Thomas Naughton. Toward a Performance/Resilience Tool for Hardware/Software Co-Design of High-Performance Computing Systems. In Proceedings of the 42nd International Conference on Parallel Processing (ICPP) 2013: 4th International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI), pages 962-971, Lyon, France, October 2, 2013. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-5117-3. ISSN 0190-3918. DOI 10.1109/ICPP.2013.114. Abstract Publication Presentation BibTeX Citation
  17. Mahesh Lagadapati, Frank Mueller, and Christian Engelmann. Tools for Simulation and Benchmark Generation at Exascale. In Proceedings of the 7th Parallel Tools Workshop, pages 19-24, Dresden, Germany, September 3-4, 2013. Springer Verlag, Berlin, Germany. ISBN 978-3-319-08143-4. DOI 10.1007/978-3-319-08144-1_2. Abstract Publication Presentation BibTeX Citation
  18. Thomas Naughton, Swen Böhm, Christian Engelmann, and Geoffroy Vallée. Using Performance Tools to Support Experiments in HPC Resilience. In Lecture Notes in Computer Science: Proceedings of the 19th European Conference on Parallel and Distributed Computing (Euro-Par) 2013 Workshops: 6th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 727-736, Aachen, Germany, August 26, 2013. Springer Verlag, Berlin, Germany. ISBN 978-3-642-54419-4. ISSN 0302-9743. DOI 10.1007/978-3-642-54420-0_71. Acceptance rate 87.5% (7/8). Abstract Publication Presentation BibTeX Citation
  19. Ian S. Jones and Christian Engelmann. Simulation of Large-Scale HPC Architectures. In Proceedings of the 40th International Conference on Parallel Processing (ICPP) 2011: 2nd International Workshop on Parallel Software Tools and Tool Infrastructures (PSTI), pages 447-456, Taipei, Taiwan, September 13-19, 2011. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-4511-0. ISSN 1530-2016. DOI 10.1109/ICPPW.2011.44. Abstract Publication Presentation BibTeX Citation
  20. David Fiala, Kurt Ferreira, Frank Mueller, and Christian Engelmann. A Tunable, Software-based DRAM Error Detection and Correction Library for HPC. In Lecture Notes in Computer Science: Proceedings of the 17th European Conference on Parallel and Distributed Computing (Euro-Par) 2011 Workshops, Part II: 4th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, pages 251-261, Bordeaux, France, August 29 – September 2, 2011. Springer Verlag, Berlin, Germany. ISBN 978-3-642-29740-3. DOI 10.1007/978-3-642-29740-3_29. Acceptance rate 60.0% (12/20). Abstract Publication BibTeX Citation
  21. Thomas Naughton, Geoffroy R. Vallée, Christian Engelmann, and Stephen L. Scott. A Case for Virtual Machine based Fault Injection in a High-Performance Computing Environment. In Lecture Notes in Computer Science: Proceedings of the 17th European Conference on Parallel and Distributed Computing (Euro-Par) 2011: 5th Workshop on System-level Virtualization for High Performance Computing (HPCVirt), pages 234-243, Bordeaux, France, August 29 – September 2, 2011. Springer Verlag, Berlin, Germany. ISBN 978-3-642-29737. DOI 10.1007/978-3-642-29737-3_27. Abstract Publication Presentation BibTeX Citation
  22. Christian Engelmann and Frank Lauer. Facilitating Co-Design for Extreme-Scale Systems Through Lightweight Simulation. In Proceedings of the 12th IEEE International Conference on Cluster Computing (Cluster) 2010: 1st Workshop on Application/Architecture Co-design for Extreme-scale Computing (AACEC), pages 1-8, Hersonissos, Crete, Greece, September 20-24, 2010. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4244-8395-2. DOI 10.1109/CLUSTERWKSP.2010.5613113. Abstract Publication Presentation BibTeX Citation
  23. George Ostrouchov, Thomas Naughton, Christian Engelmann, Geoffroy R. Vallée, and Stephen L. Scott. Nonparametric Multivariate Anomaly Analysis in Support of HPC Resilience. In Proceedings of the 5th IEEE International Conference on e-Science (e-Science) 2009: Workshop on Computational Science, pages 80-85, Oxford, UK, December 9-11, 2009. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-1-4244-5946-9. DOI 10.1109/ESCIW.2009.5407992. Abstract Publication Presentation BibTeX Citation
  24. Thomas Naughton, Wesley Bland, Geoffroy R. Vallée, Christian Engelmann, and Stephen L. Scott. Fault Injection Framework for System Resilience Evaluation – Fake Faults for Finding Future Failures. In Proceedings of the 18th International Symposium on High Performance Distributed Computing (HPDC) 2009: 2nd Workshop on Resiliency in High Performance Computing (Resilience) 2009, pages 23-28, Munich, Germany, June 9, 2009. ACM Press, New York, NY, USA. ISBN 978-1-60558-587-1. DOI 10.1145/1552526.1552530. Abstract Publication Presentation BibTeX Citation
  25. Anand Tikotekar, Hong H. Ong, Sadaf Alam, Geoffroy R. Vallée, Thomas Naughton, Christian Engelmann, and Stephen L. Scott. Performance Comparison of Two Virtual Machine Scenarios Using an HPC Application – A Case study Using Molecular Dynamics Simulations. In Proceedings of the 3rd Workshop on System-level Virtualization for High Performance Computing (HPCVirt) 2009, in conjunction with the 4th ACM SIGOPS European Conference on Computer Systems (EuroSys) 2009, pages 33-40, Nuremberg, Germany, March 30, 2009. ACM Press, New York, NY, USA. ISBN 978-1-60558-465-2. DOI 10.1145/1519138.1519143. Abstract Publication Presentation BibTeX Citation
  26. Geoffroy R. Vallée, Thomas Naughton, Hong H. Ong, Anand Tikotekar, Christian Engelmann, Wesley Bland, Ferrol Aderholt, and Stephen L. Scott. Virtual System Environments. In Communications in Computer and Information Science: Proceedings of the 2nd DMTF Academic Alliance Workshop on Systems and Virtualization Management: Standards and New Technologies (SVM) 2008, pages 72-83, Munich, Germany, October 21-22, 2008. Springer Verlag, Berlin, Germany. ISBN 978-3-540-88707-2. ISSN 1865-0929. DOI 10.1007/978-3-540-88708-9_7. Abstract Publication BibTeX Citation
  27. Anand Tikotekar, Geoffroy Vallée, Thomas Naughton, Hong H. Ong, Christian Engelmann, and Stephen L. Scott. An Analysis of HPC Benchmark Applications in Virtual Machine Environments. In Lecture Notes in Computer Science: Proceedings of the 14th European Conference on Parallel and Distributed Computing (Euro-Par) 2008: 3rd Workshop on Virtualization in High-Performance Cluster and Grid Computing (VHPC) 2008, pages 63-71, Las Palmas de Gran Canaria, Spain, August 26-29, 2008. Springer Verlag, Berlin, Germany. ISBN 978-3-642-00954-9. DOI 10.1007/978-3-642-00955-6. Abstract Publication Presentation BibTeX Citation
  28. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Symmetric Active/Active High Availability for High-Performance Computing System Services: Accomplishments and Limitations. In Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid) 2008: Workshop on Resiliency in High Performance Computing (Resilience) 2008, pages 813-818, Lyon, France, May 19-22, 2008. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 978-0-7695-3156-4. DOI 10.1109/CCGRID.2008.78. Abstract Publication Presentation BibTeX Citation
  29. Xin Chen, Benjamin Eckart, Xubin (Ben) He, Christian Engelmann, and Stephen L. Scott. An Online Controller Towards Self-Adaptive File System Availability and Performance. In Proceedings of the 5th High Availability and Performance Workshop (HAPCW) 2008, in conjunction with the 1st High-Performance Computer Science Week (HPCSW) 2008, Denver, CO, USA, April 3-4, 2008. Abstract Publication Presentation BibTeX Citation
  30. Anand Tikotekar, Geoffroy Vallée, Thomas Naughton, Hong H. Ong, Christian Engelmann, Stephen L. Scott, and Anthony M. Filippi. Effects of Virtualization on a Scientific Application – Running a Hyperspectral Radiative Transfer Code on Virtual Machines. In Proceedings of the 2nd Workshop on System-level Virtualization for High Performance Computing (HPCVirt) 2008, in conjunction with the 3rd ACM SIGOPS European Conference on Computer Systems (EuroSys) 2008, pages 16-23, Glasgow, UK, March 31, 2008. ACM Press, New York, NY, USA. ISBN 978-1-60558-120-0. DOI 10.1145/1435452.1435455. Abstract Publication Presentation BibTeX Citation
  31. Christian Engelmann, Hong H. Ong, and Stephen L. Scott. Middleware in Modern High Performance Computing System Architectures. In Lecture Notes in Computer Science: Proceedings of the 7th International Conference on Computational Science (ICCS) 2007, Part II: 4th Special Session on Collaborative and Cooperative Environments (CCE) 2007, pages 784-791, Beijing, China, May 27-30, 2007. Springer Verlag, Berlin, Germany. ISBN 3-5407-2585-5. ISSN 0302-9743. DOI 10.1007/978-3-540-72586-2_111. Abstract Publication Presentation BibTeX Citation
  32. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Transparent Symmetric Active/Active Replication for Service-Level High Availability. In Proceedings of the 7th IEEE International Symposium on Cluster Computing and the Grid (CCGrid) 2007: 7th International Workshop on Global and Peer-to-Peer Computing (GP2PC) 2007, pages 755-760, Rio de Janeiro, Brazil, May 14-17, 2007. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7695-2833-3. DOI 10.1109/CCGRID.2007.116. Abstract Publication Presentation BibTeX Citation
  33. Christian Engelmann, Stephen L. Scott, Hong H. Ong, Geoffroy R. Vallée, and Thomas Naughton. Configurable Virtualized System Environments for High Performance Computing. In Proceedings of the 1st Workshop on System-level Virtualization for High Performance Computing (HPCVirt) 2007, in conjunction with the 2nd ACM SIGOPS European Conference on Computer Systems (EuroSys) 2007, Lisbon, Portugal, March 20, 2007. Abstract Publication Presentation BibTeX Citation
  34. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Towards High Availability for High-Performance Computing System Services: Accomplishments and Limitations. In Proceedings of the 4th High Availability and Performance Workshop (HAPCW) 2006, in conjunction with the 7th Los Alamos Computer Science Institute (LACSI) Symposium 2006, Santa Fe, NM, USA, October 17, 2006. Abstract Publication Presentation BibTeX Citation
  35. Li Ou, Xin Chen, Xubin (Ben) He, Christian Engelmann, and Stephen L. Scott. Achieving Computational I/O Effciency in a High Performance Cluster Using Multicore Processors. In Proceedings of the 4th High Availability and Performance Workshop (HAPCW) 2006, in conjunction with the 7th Los Alamos Computer Science Institute (LACSI) Symposium 2006, Santa Fe, NM, USA, October 17, 2006. Abstract Publication Presentation BibTeX Citation
  36. Christian Engelmann and George A. (Al) Geist. RMIX: A Dynamic, Heterogeneous, Reconfigurable Communication Framework. In Lecture Notes in Computer Science: Proceedings of the 6th International Conference on Computational Science (ICCS) 2006, Part II: 3rd Special Session on Collaborative and Cooperative Environments (CCE) 2006, pages 573-580, Reading, UK, May 28-31, 2006. Springer Verlag, Berlin, Germany. ISBN 3-540-34381-4. ISSN 0302-9743. DOI 10.1007/11758525_77. Abstract Publication Presentation BibTeX Citation
  37. Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Active/Active Replication for Highly Available HPC System Services. In Proceedings of the 1st International Conference on Availability, Reliability and Security (ARES) 2006: 1st International Workshop on Frontiers in Availability, Reliability and Security (FARES) 2006, pages 639-645, Vienna, Austria, April 20-22, 2006. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7695-2567-9. DOI 10.1109/ARES.2006.23. Abstract Publication Presentation BibTeX Citation
  38. Christian Engelmann and Stephen L. Scott. Concepts for High Availability in Scientific High-End Computing. In Proceedings of the 3rd High Availability and Performance Workshop (HAPCW) 2005, in conjunction with the 6th Los Alamos Computer Science Institute (LACSI) Symposium 2005, Santa Fe, NM, USA, October 11, 2005. Abstract Publication Presentation BibTeX Citation
  39. Christian Engelmann and Stephen L. Scott. High Availability for Ultra-Scale High-End Scientific Computing. In Proceedings of the 2nd International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters (COSET-2) 2005, in conjunction with the 19th ACM International Conference on Supercomputing (ICS) 2005, Cambridge, MA, USA, June 19, 2005. Abstract Publication Presentation BibTeX Citation
  40. Chokchai (Box) Leangsuksun, Venkata K. Munganuru, Tong Liu, Stephen L. Scott, and Christian Engelmann. Asymmetric Active-Active High Availability for High-end Computing. In Proceedings of the 2nd International Workshop on Operating Systems, Programming Environments and Management Tools for High-Performance Computing on Clusters (COSET-2) 2005, in conjunction with the 19th ACM International Conference on Supercomputing (ICS) 2005, Cambridge, MA, USA, June 19, 2005. Abstract Publication Presentation BibTeX Citation
  41. Christian Engelmann and George A. (Al) Geist. A Lightweight Kernel for the Harness Metacomputing Framework. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2005: 14th Heterogeneous Computing Workshop (HCW) 2005, Denver, CO, USA, April 4, 2005. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7695-2312-9. ISSN 1530-2075. DOI 10.1109/IPDPS.2005.34. Abstract Publication Presentation BibTeX Citation
  42. Christian Engelmann, Stephen L. Scott, and George A. (Al) Geist. High Availability through Distributed Control. In Proceedings of the 2nd High Availability and Performance Workshop (HAPCW) 2004, in conjunction with the 5th Los Alamos Computer Science Institute (LACSI) Symposium 2004, Santa Fe, NM, USA, October 12, 2004. Abstract Publication Presentation BibTeX Citation
  43. Xubin (Ben) He, Li Ou, Stephen L. Scott, and Christian Engelmann. A Highly Available Cluster Storage System using Scavenging. In Proceedings of the 2nd High Availability and Performance Workshop (HAPCW) 2004, in conjunction with the 5th Los Alamos Computer Science Institute (LACSI) Symposium 2004, Santa Fe, NM, USA, October 12, 2004. Abstract Publication Presentation BibTeX Citation
  44. Christian Engelmann and George A. (Al) Geist. A Diskless Checkpointing Algorithm for Super-scale Architectures Applied to the Fast Fourier Transform. In Proceedings of the Challenges of Large Applications in Distributed Environments Workshop (CLADE) 2003, in conjunction with the 12th IEEE International Symposium on High Performance Distributed Computing (HPDC) 2003, pages 47, Seattle, WA, USA, June 21, 2003. IEEE Computer Society, Los Alamitos, CA, USA. ISBN 0-7695-1984-9. DOI xpls/abs_all.jsp?arnumber=4159902. Abstract Publication Presentation BibTeX Citation
  45. Christian Engelmann, Stephen L. Scott, and George A. (Al) Geist. Distributed Peer-to-Peer Control in Harness. In Lecture Notes in Computer Science: Proceedings of the 2nd International Conference on Computational Science (ICCS) 2002, Part II: Workshop on Global and Collaborative Computing, pages 720-727, Amsterdam, The Netherlands, April 21-24, 2002. Springer Verlag, Berlin, Germany. ISBN 3-540-43593-X. ISSN 0302-9743. DOI content/l537ujfwt8yta2dp. Abstract Publication Presentation BibTeX Citation

Peer-reviewed Conference Posters

  1. Christian Engelmann, Swen Boehm, Michael Brim, Jack Lange, Thomas Naughton, Patrick Widener, Ben Mintz, and Rohit Srivastava. INTERSECT: The Open Federated Architecture for the Laboratory of the Future. Poster at the 52nd International Conference on Parallel Processing (ICPP) 2023, Salt Lake City, UT, USA, August 7-10, 2023. Abstract Publication BibTeX Citation
  2. Christian Engelmann and Mohit Kumar. Resilience Design Patterns: A Structured Modeling Approach of Resilience in Computing Systems. Poster at the Workshop on Modeling and Simulation of Systems and Applications (ModSim) 2022, Seattle, WA, USA, August 10-12, 2022. Abstract Publication BibTeX Citation
  3. Yawei Hui, Rizwan Ashraf, Byung Hoon (Hoony) Park, and Christian Engelmann. Real-Time Assessment of Supercomputer Status by a Comprehensive Informative Metric through Streaming Processing. Poster at the 6th IEEE International Conference on Big Data (BigData) 2018, Seattle, WA, USA, December 10-13, 2018. Abstract Publication BibTeX Citation
  4. Yawei Hui, Byung Hoon (Hoony) Park, and Christian Engelmann. A Comprehensive Informative Metric for Summarizing HPC System Status. Poster at the 8th IEEE Symposium on Large Data Analysis and Visualization in conjunction with the 8th IEEE Vis 2018, Berlin, Germany, October 21, 2018. Abstract Publication BibTeX Citation
  5. Christian Engelmann and Rizwan Ashraf. Modeling and Simulation of Extreme-Scale Systems for Resilience by Design. Poster at the Workshop on Modeling and Simulation of Systems and Applications, Seattle, WA, USA, August 15-17, 2018. Abstract Publication BibTeX Citation
  6. Onkar Patil, Saurabh Hukerikar, Frank Mueller, and Christian Engelmann. Exploring Use Cases for Non-Volatile Memories in Support of HPC Resilience. Poster at the 30th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2017, Denver, CO, USA, November 12-17, 2017. Abstract Publication BibTeX Citation
  7. David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, and Kurt Ferreira. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. Poster at the 24th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2011, Seattle, WA, USA, November 12-18, 2011. Abstract BibTeX Citation
  8. David Fiala, Kurt Ferreira, Frank Mueller, and Christian Engelmann. A Tunable, Software-based DRAM Error Detection and Correction Library for HPC. Poster at the 24th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2011, Seattle, WA, USA, November 12-18, 2011. Abstract BibTeX Citation
  9. Stephen L. Scott, Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, George Ostrouchov, Chokchai (Box) Leangsuksun, Nichamon Naksinehaboon, Raja Nassar, Mihaela Paun, Frank Mueller, Chao Wang, Arun B. Nagarajan, and Jyothish Varma. A Tunable Holistic Resiliency Approach for High-Performance Computing Systems. Poster at the National HPC Workshop on Resilience 2009, Arlington, VA, USA, August 12-14, 2009. Abstract Publication BibTeX Citation
  10. Stephen L. Scott, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, Christian Engelmann, and Hong H. Ong. System-level Virtualization for for High-Performance Computing. Poster at the National HPC Workshop on Resilience 2009, Arlington, VA, USA, August 12-14, 2009. Abstract Publication BibTeX Citation
  11. Stephen L. Scott, Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, George Ostrouchov, Chokchai (Box) Leangsuksun, Nichamon Naksinehaboon, Raja Nassar, Mihaela Paun, Frank Mueller, Chao Wang, Arun B. Nagarajan, and Jyothish Varma. A Tunable Holistic Resiliency Approach for High-Performance Computing Systems. Poster at the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) 2009, Raleigh, NC, USA, February 14-18, 2009. Abstract Publication BibTeX Citation
  12. George A. (Al) Geist, Christian Engelmann, Jack J. Dongarra, George Bosilca, Magdalena M. Sławińska, and Jarosław K. Sławiński. The Harness Workbench: Unified and Adaptive Access to Diverse High-Performance Computing Platforms. Poster at the 1st High-Performance Computer Science Week (HPCSW) 2008, Denver, CO, USA, March 30 – April 5, 2008. Abstract Publication BibTeX Citation
  13. Stephen L. Scott, Christian Engelmann, Hong H. Ong, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, George Ostrouchov, Chokchai (Box) Leangsuksun, Nichamon Naksinehaboon, Raja Nassar, Mihaela Paun, Frank Mueller, Chao Wang, Arun B. Nagarajan, Jyothish Varma, Xubin (Ben) He, Li Ou, and Xin Chen. Resiliency for High-Performance Computing Systems. Poster at the 1st High-Performance Computer Science Week (HPCSW) 2008, Denver, CO, USA, March 30 – April 5, 2008. Abstract Publication BibTeX Citation
  14. Stephen L. Scott, Geoffroy R. Vallée, Thomas Naughton, Anand Tikotekar, Christian Engelmann, and Hong H. Ong. System-level Virtualization for for High-Performance Computing. Poster at the 1st High-Performance Computer Science Week (HPCSW) 2008, Denver, CO, USA, March 30 – April 5, 2008. Abstract Publication BibTeX Citation

White Papers

  1. Ryan Adamson and Christian Engelmann. Cybersecurity and Privacy for Instrument-to-Edge-to-Center Scientific Computing Ecosystems. White paper accepted at the U.S. Department of Energy's ASCR Workshop on Cybersecurity and Privacy for Scientific Computing Ecosystems, November 3-5, 2021. Abstract Publication BibTeX Citation
  2. Mingyan Li, Robert A. Bridges, Pablo Moriano, Christian Engelmann, Feiyi Wang, and Ryan Adamson. Toward Effective Security/Reliability Situational Awareness via Concurrent Security-or-Fault Analytics . White paper accepted at the U.S. Department of Energy's ASCR Workshop on Cybersecurity and Privacy for Scientific Computing Ecosystems, November 3-5, 2021. Abstract Publication BibTeX Citation
  3. Hal Finkel, Pete Beckman, Christian Engelmann, Shantenu Jha, and Jack Lange. Research Opportunities in Operating Systems for Scientific Edge Computing. White paper by the U.S. Department of Energy's ASCR Roundtable Discussions on Operating-Systems Research 2021, January 25, 2021. Abstract Publication BibTeX Citation
  4. Hal Finkel, Pete Beckman, Ron Brightwell, Rudi Eigenmann, Christian Engelmann, Roberto Gioiosa, Kamil Iskra, Shantenu Jha, Jack Lange, Tapasya Patki, and Kevin Pedretti. Research Opportunities in Operating Systems for High-Performance Scientific Computing. White paper by the U.S. Department of Energy's ASCR Roundtable Discussions on Operating-Systems Research 2021, January 25, 2021. Abstract Publication BibTeX Citation
  5. Christian Engelmann. Resilience by Codesign (and not as an Afterthought). White paper accepted at the U.S. Department of Energy's Workshop on Reimagining Codesign 2021, March 16-18, 2021. Abstract Publication Presentation BibTeX Citation
  6. Petar Radojkovic, Manolis Marazakis, Paul Carpenter, Reiley Jeyapaul, Dimitris Gizopoulos, Martin Schulz, Adria Armejach, Eduard Ayguade, François Bodin, Ramon Canal, Franck Cappello, Fabien Chaix, Guillaume Colin de Verdiere, Said Derradji, Stefano Di Carlo, Christian Engelmann, Ignacio Laguna, Miquel Moreto, Onur Mutlu, Lazaros Papadopoulos, Olly Perks, Manolis Ploumidis, Bezhad Salami, Yanos Sazeides, Dimitrios Soudris, Yiannis Sourdis, Per Stenstrom, Samuel Thibault, Will Toms, and Osman Unsal. Towards Resilient EU HPC Systems: A Blueprint. White paper by the European HPC resilience initiative, April 9, 2020. Abstract Publication BibTeX Citation
  7. Christian Engelmann, Rizwan Ashraf, and Saurabh Hukerikar. Extreme Heterogeneity with Resilience by Design (and not as an Afterthought). White paper accepted at the U.S. Department of Energy's Extreme Heterogeneity Virtual Workshop 2018, January 23-24, 2018. Abstract Publication BibTeX Citation
  8. Devesh Tiwari, Saurabh Gupta, and Christian Engelmann. Lightweight, Actionable Analytical Tools Based on Statistical Learning for Efficient System Operations. White paper accepted at the U.S. Department of Energy's Workshop on Modeling & Simulation of Systems & Applications (ModSim) 2016, August 10-12, 2016. Abstract Publication Presentation BibTeX Citation
  9. Christian Engelmann and Thomas Naughton. A Hardware/Software Performance/Resilience/Power Co-Design Tool for Extreme-scale Computing. White paper accepted at the U.S. Department of Energy's Workshop on Modeling & Simulation of Exascale Systems & Applications (ModSim) 2013, September 18-19, 2013. Abstract Publication Presentation BibTeX Citation
  10. Marc Snir, and Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, Pavan Balaji, Bill Carlson, Andrew A. Chien, Pedro Diniz, Christian Engelmann, Rinku Gupta, Fred Johnson, Jim Belak, Pradip Bose, Franck Cappello, Paul Coteus, Nathan A. Debardeleben, Mattan Erez, Saverio Fazzari, Al Geist, Sriram Krishnamoorthy, Sven Leyffer, Dean Liberty, Subhasish Mitra, Todd Munson, Rob Schreiber, Jon Stearley, and Eric Van Hensbergen. Addressing Failures in Exascale Computing. Workshop report, August 4-11, 2013. Publication BibTeX Citation
  11. Al Geist, Bob Lucas, Marc Snir, Shekhar Borkar, Eric Roman, Mootaz Elnozahy, Bert Still, Andrew Chien, Robert Clay, John Wu, Christian Engelmann, Nathan DeBardeleben, Rob Ross, Larry Kaplan, Martin Schulz, Mike Heroux, Sriram Krishnamoorthy, Lucy Nowell, Abhinav Vishnu, and Lee-Ann Talley. U.S. Department of Energy Fault Management Workshop. Workshop report for the U.S. Department of Energy, June 6, 2012. Abstract Publication BibTeX Citation
  12. Christian Engelmann and Thomas Naughton. A Performance/Resilience/Power Co-design Tool for Extreme-scale High-Performance Computing. White paper accepted at the U.S. Department of Energy's Workshop on Modeling & Simulation of Exascale Systems & Applications (ModSim) 2012, August 9-10, 2012. Abstract Publication BibTeX Citation
  13. Christian Engelmann, Geoffroy R. Vallée, Thomas Naughton, and Frank Mueller. Dynamic Self-Aware Runtime Software for Exascale Systems. White paper for the U.S. Department of Energy's Exascale Operating Systems and Runtime Technical Council, July 1, 2012. Abstract Publication Presentation BibTeX Citation
  14. Geoffroy R. Vallée, Thomas Naughton, Christian Engelmann, and David E. Bernholdt. Unified Execution Environment. White paper for the U.S. Department of Energy's Exascale Operating Systems and Runtime Technical Council, July 1, 2012. Publication BibTeX Citation
  15. Nathan DeBardeleben, James Laros, John T. Daly, Stephen L. Scott, Christian Engelmann, and Bill Harrod. High-End Computing Resilience: Analysis of Issues Facing the HEC Community and Path-Forward for Research and Development. White paper for the U.S. National Science Foundation's High-end Computing Program, December 1, 2009. Publication BibTeX Citation

Technical Reports

  1. Christian Engelmann and Suhas Somnath. INTERSECT Architecture Specification: Use Case Design Patterns (Version 0.9). Technical Report, ORNL/TM-2023/3133, Oak Ridge National Laboratory, Oak Ridge, TN, USA, September 1, 2023. DOI 10.2172/2229218. Abstract Publication BibTeX Citation
  2. Christian Engelmann, Rizwan Ashraf, Saurabh Hukerikar, Mohit Kumar, and Piyush Sao. Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (Version 2.0). Technical Report, ORNL/TM-2022/2809, Oak Ridge National Laboratory, Oak Ridge, TN, USA, August 1, 2022. DOI 10.2172/1922296. Abstract Publication BibTeX Citation
  3. Michael Brim and Christian Engelmann. INTERSECT Architecture Specification: Microservice Architecture (Version 0.5). Technical Report, ORNL/TM-2022/2715, Oak Ridge National Laboratory, Oak Ridge, TN, USA, September 1, 2022. DOI 10.2172/1902805. Abstract Publication BibTeX Citation
  4. Christian Engelmann and Suhas Somnath. INTERSECT Architecture Specification: Use Case Design Patterns (Version 0.5). Technical Report, ORNL/TM-2022/2681, Oak Ridge National Laboratory, Oak Ridge, TN, USA, September 1, 2022. DOI 10.2172/1896984. Abstract Publication BibTeX Citation
  5. Saurabh Hukerikar and Christian Engelmann. Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (Version 1.2). Technical Report, ORNL/TM-2017/745, Oak Ridge National Laboratory, Oak Ridge, TN, USA, August 1, 2017. DOI 10.2172/1436045. Abstract Publication BibTeX Citation
  6. Saurabh Hukerikar and Christian Engelmann. Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (Version 1.1). Technical Report, ORNL/TM-2016/767, Oak Ridge National Laboratory, Oak Ridge, TN, USA, December 1, 2016. DOI 10.2172/1345793. Abstract Publication BibTeX Citation
  7. Saurabh Hukerikar and Christian Engelmann. Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (Version 1.0). Technical Report, ORNL/TM-2016/687, Oak Ridge National Laboratory, Oak Ridge, TN, USA, October 1, 2016. DOI 10.2172/1338552. Abstract Publication BibTeX Citation
  8. David Fiala, Frank Mueller, Christian Engelmann, Kurt Ferreira, Ron Brightwell, and Rolf Riesen. Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing. Technical Report, ORNL/TM-2012/227, Oak Ridge National Laboratory, Oak Ridge, TN, USA, June 1, 2012. Abstract Publication BibTeX Citation
  9. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Hybrid Full/Incremental Checkpoint/Restart for MPI Jobs in HPC Environments. Technical Report, ORNL/TM-2010/162, Oak Ridge National Laboratory, Oak Ridge, TN, USA, August 1, 2010. Abstract Publication BibTeX Citation
  10. Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration and Back Migration in HPC Environments. Technical Report, ORNL/TM-2010/161, Oak Ridge National Laboratory, Oak Ridge, TN, USA, August 1, 2010. Abstract Publication BibTeX Citation

Datasets

  1. Woong Shin, Vladyslav Oles, Anna Schmedding, George Ostrouchov, Evgenia Smirni, Christian Engelmann, and Feiyi Wang. OLCF Summit Supercomputer GPU Snapshots During Double-Bit Errors and Normal Operations. Dataset, April 20, 2023. DOI 10.13139/OLCF/1970187. Abstract BibTeX Citation

Talks and Lectures

  1. Christian Engelmann. The Interconnected Science Ecosystem (INTERSECT). Invited talk at the Hartree Centre, Science and Technology Facilities Council, Daresbury, UK, October 4, 2023. Abstract Presentation BibTeX Citation
  2. Christian Engelmann. The Interconnected Science Ecosystem (INTERSECT) Architecture. Invited talk at the 20th Smoky Mountains Computational Sciences & Engineering Conference (SMC), Knoxville, TN, USA, August 21-23, 2023. Abstract Presentation BibTeX Citation
  3. Christian Engelmann. The Interconnected Science Ecosystem (INTERSECT) Architecture. Seminar at the Leibniz Rechenzentrum (LRZ), Garching, Germany, July 10, 2023. Abstract Presentation BibTeX Citation
  4. Christian Engelmann. The Interconnected Science Ecosystem (INTERSECT) Architecture. Invited talk at the 1st Ecosystems for Smart Autonomous Interconnected Labs (E-SAIL) Workshop, held in conjunction with the 38th ISC High Performance (ISC) 2023, Hamburg, Germany, May 25, 2023. Abstract Presentation BibTeX Citation
  5. Christian Engelmann. Designing Smart and Resilient Extreme-Scale Systems. Invited talk at the 20th SIAM Conference on Parallel Processing for Scientific Computing (PP) 2022, Seattle, WA, USA, February 23-26, 2022. Abstract Presentation BibTeX Citation
  6. Ben Mintz, Christian Engelmann, Elke Arenholz, and Ryan Coffee. Enabling Self-Driven Experiments for Science through an Interconnected Science Ecosystem (INTERSECT). Panel at the 17th Smoky Mountains Computational Sciences & Engineering Conference (SMC), October 20, 2021. BibTeX Citation
  7. Christian Engelmann. Faults, Errors and Failures in Extreme-Scale Supercomputers. Keynote talk at the 14th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, held in conjunction with the 27th European Conference on Parallel and Distributed Computing (Euro-Par) 2021, Lisbon, Portugal, August 30, 2021. Abstract Presentation BibTeX Citation
  8. Christian Engelmann. The Resilience Problem in Extreme Scale Computing: Experiences and the Path Forward. Invited talk at the SIAM Conference on Computational Science and Engineering (CSE) 2021, Fort Worth, TX, USA, March 1-5, 2021. Abstract Presentation BibTeX Citation
  9. Christian Engelmann. Smart and Resilient Extreme-Scale Systems. Invited talk at the Workshop on Resilience in High Performance Computing (RESILIENTHPC), held in conjunction with the European Network on High-performance Embedded Architecture and Compilation (HiPEAC) Conference 2021, Budapest, Hungary, January 19, 2021. Abstract Presentation BibTeX Citation
  10. Christian Engelmann. The Resilience Problem in Extreme Scale Computing. Invited talk at the 19th SIAM Conference on Parallel Processing for Scientific Computing (PP) 2020, Seattle, WA, USA, February 12-15, 2020. Abstract Presentation BibTeX Citation
  11. Christian Engelmann. Resilience in Parallel Programming Environments. Invited talk at the 8th Accelerated Data Analytics and Computing (ADAC) Institute Workshop, Tokyo, Japan, October 30-31, 2019. Abstract Presentation BibTeX Citation
  12. Christian Engelmann. Resilience by Design (and not as an Afterthought). Invited talk at the 23rd Workshop on Distributed Supercomputing (SOS) 2019, Asheville, NC, USA, March 26-29, 2018. Abstract Presentation BibTeX Citation
  13. Christian Engelmann. Resilience for Extreme Scale Systems: Understanding the Problem. Invited talk at the SIAM Conference on Computational Science and Engineering (CSE) 2019, Spokane, WA, USA, February 25 – March 1, 2018. Abstract Presentation BibTeX Citation
  14. Christian Engelmann and Rizwan Ashraf. Modeling and Simulation of Extreme-Scale Systems for Resilience by Design. Invited talk at the Workshop on Modeling and Simulation of Systems and Applications, Seattle, WA, USA, August 15-17, 2018. Abstract Presentation BibTeX Citation
  15. Christian Engelmann. Characterizing Faults, Errors, and Failures in Extreme-Scale Systems. Invited talk at the Platform for Advanced Scientific Computing (PASC) Conference 2018, Basel, Switzerland, July 2-4, 2018. Abstract Presentation BibTeX Citation
  16. Christian Engelmann. Characterizing Faults, Errors, and Failures in Extreme-Scale Systems. Invited talk at the 6th Accelerated Data Analytics and Computing (ADAC) Institute Workshop, Zurich, Switzerland, June 20-21, 2018. Abstract Presentation BibTeX Citation
  17. Christian Engelmann. Pattern-based Modeling of Fail-stop and Soft-error Resilience for Iterative Linear Solvers. Invited talk at the 18th SIAM Conference on Parallel Processing for Scientific Computing (PP) 2018, Tokyo, Japan, March 7-10, 2018. Abstract Presentation BibTeX Citation
  18. Christian Engelmann. Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale. Invited talk at the 18th SIAM Conference on Parallel Processing for Scientific Computing (PP) 2018, Tokyo, Japan, March 7-10, 2018. Abstract Presentation BibTeX Citation
  19. Christian Engelmann. A Catalog of Faults, Errors, and Failures in Extreme-Scale Systems. Invited talk at the SIAM Annual Meeting (AM) 2017, Pittsburgh, PA, USA, July 10-14, 2017. Abstract Presentation BibTeX Citation
  20. Christian Engelmann. Characterizing Faults, Errors and Failures in Extreme-Scale Computing Systems. Invited talk at the International Supercomputing Conference (ISC) 2017, Frankfurt am Main, Germany, June 16-22, 2017. Abstract Presentation BibTeX Citation
  21. Christian Engelmann. A Catalog of Faults, Errors, and Failures in Extreme-Scale Systems. Invited talk at the 12th Scheduling for Large Scale Systems Workshop (SLSSW) 2017, Knoxville, TN, USA, May 24-26, 2017. Abstract Presentation BibTeX Citation
  22. Christian Engelmann. The Missing High-Performance Computing Fault Model. Invited talk at the 17th SIAM Conference on Parallel Processing for Scientific Computing (PP) 2016, Paris, France, April 12-15, 2016. Abstract Presentation BibTeX Citation
  23. Christian Engelmann. Resilience Challenges and Solutions for Extreme-Scale Supercomputing. Invited talk at the United States Naval Academy, Annapolis, MD, USA, February 18, 2016. Abstract Presentation BibTeX Citation
  24. Christian Engelmann. Toward A Fault Model And Resilience Design Patterns For Extreme Scale Systems. Keynote talk at the 8th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, held in conjunction with the 21st European Conference on Parallel and Distributed Computing (Euro-Par) 2015, Vienna, Austria, August 24-28, 2015. Abstract Presentation BibTeX Citation
  25. Christian Engelmann. Resilience Challenges and Solutions for Extreme-Scale Supercomputing. Invited talk at the 19th Workshop on Distributed Supercomputing (SOS) 2015, Park City, UT, USA, March 2-5, 2015. Abstract Presentation BibTeX Citation
  26. Christian Engelmann. xSim: The Extreme-scale Simulator. Seminar at the Leibniz Rechenzentrum (LRZ), Garching, Germany, February 23, 2015. Abstract Presentation BibTeX Citation
  27. Christian Engelmann. Supporting the Development of Resilient Message Passing Applications using Simulation. Invited talk at the Dagstuhl Seminar on Resilience in Exascale Computing, Schloss Dagstuhl, Wadern, Germany, September 28 – October 1, 2014. Abstract Presentation BibTeX Citation
  28. Christian Engelmann. Resilience Challenges and Solutions for Extreme-Scale Supercomputing. Invited talk at the Technical University of Dresden, Dresden, Germany, September 3, 2013. Abstract Presentation BibTeX Citation
  29. Christian Engelmann. Fault Tolerance Session. Invited talk at the The ExaChallenge Symposium, Dublin, Ireland, October 16-17, 2012. Presentation BibTeX Citation
  30. Christian Engelmann. High-End Computing Resilience: Analysis of Issues Facing the HEC Community and Path Forward for Research and Development. Invited talk at the Argonne National Laboratory (ANL) Institute of Computing in Science (ICiS) Summer Workshop Week on Addressing Failures in Exascale Computing, Park City, UT, USA, August 4-11, 2012. Abstract Presentation BibTeX Citation
  31. Christian Engelmann. Resilience for Permanent, Transient, and Undetected Errors. Invited talk at the 16th Workshop on Distributed Supercomputing (SOS) 2012, Santa Barbara, CA, USA, March 12-15, 2012. Abstract Presentation BibTeX Citation
  32. Christian Engelmann. Scaling To A Million Cores And Beyond: A Basic Understanding Of The Challenges Ahead On The Road To Exascale. Invited talk at the 1st International Workshop on Extreme Scale Parallel Architectures and Systems (ESPAS) 2012, in conjunction with the 7th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC) 2012, Paris France, January 24, 2012. Abstract Presentation BibTeX Citation
  33. Christian Engelmann. Resilient Software for ExaScale Computing. Invited talk at the Birds of a Feather Session on Resilient Software for ExaScale Computing at the 24th IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2011, Seattle, WA, USA, November 17, 2011. Abstract Presentation BibTeX Citation
  34. Christian Engelmann. Resilience and Hardware/Software Co-design for Extreme-Scale Supercomputing. Seminar at the Barcelona Supercomputing Center, Barcelona, Spain, July 27, 2011. Abstract Presentation BibTeX Citation
  35. Christian Engelmann. Scalable HPC System Monitoring. Invited talk at the 3rd HPC Resiliency Summit: Workshop on Resiliency for Petascale HPC 2010, in conjunction with the 3rd Los Alamos Computer Science Symposium (LACSS) 2010, Santa Fe, NM, USA, October 13, 2010. Abstract Presentation BibTeX Citation
  36. Christian Engelmann. Beyond Application-Level Checkpoint/Restart – Advanced Software Approaches for Fault Resilience. Talk at the 39th SPEEDUP Workshop on High Performance Computing, Zurich, Switzerland, September 6, 2010. Presentation BibTeX Citation
  37. Christian Engelmann and Stephen L. Scott. Reliability, Availability, and Serviceability (RAS) for Petascale High-End Computing and Beyond. Talk at the Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS) Workshop, in conjunction with the USENIX Federated Conferences Week (USENIX) 2010, Boston MA, USA, June 22, 2010. Abstract Presentation BibTeX Citation
  38. Christian Engelmann. Resilience Challenges at the Exascale. Talk at the 14th Workshop on Distributed Supercomputing (SOS) 2010, Savannah, GA, USA, March 8-11, 2010. Abstract Presentation BibTeX Citation
  39. Christian Engelmann and Stephen L. Scott. HPC System Software Research at Oak Ridge National Laboratory. Seminar at the Leibniz Rechenzentrum (LRZ), Garching, Germany, February 22, 2010. Abstract Presentation BibTeX Citation
  40. Christian Engelmann. High-Performance Computing Research Internship and Appointment Opportunities at Oak Ridge National Laboratory. Seminar at the Department of Computer Science, University of Reading, Reading, United Kingdom, December 14, 2009. Abstract Presentation BibTeX Citation
  41. Christian Engelmann. JCAS – IAA Simulation Efforts at Oak Ridge National Laboratory. Invited talk at the IAA Workshop on HPC Architectural Simulation (HPCAS), Boulder, CO, USA, September 1-2, 2009. Presentation BibTeX Citation
  42. Christian Engelmann. Modeling Techniques Towards Resilience. Invited talk at the National HPC Workshop on Resilience 2009, Arlington, VA, USA, August 12-14, 2009. Presentation BibTeX Citation
  43. Christian Engelmann. System Resilience Research at ORNL in the Context of HPC. Invited talk at the Institut National de Recherche en Informatique et en Automatique (INRIA), Rennes, France, May 15, 2009. Abstract Presentation BibTeX Citation
  44. Christian Engelmann. High-Performance Computing Research and MSc Internship Opportunities at Oak Ridge National Laboratory. Seminar at the Department of Computer Science, University of Reading, Reading, United Kingdom, May 11, 2009. Abstract Presentation BibTeX Citation
  45. Christian Engelmann. Modular Redundancy for Soft-Error Resilience in Large-Scale HPC Systems. Invited talk at the Dagstuhl Seminar on Fault Tolerance in High-Performance Computing and Grids, Schloss Dagstuhl, Wadern, Germany, May 3-8, 2009. Abstract Presentation BibTeX Citation
  46. Christian Engelmann. Proactive Fault Tolerance Using Preemptive Migration. Invited talk at the 3rd Collaborative and Grid Computing Technologies Workshop (CGCTW) 2009, Cancun, Mexico, April 22-24, 2009. Abstract Presentation BibTeX Citation
  47. Christian Engelmann. Resiliency. Panel at the 13th Workshop on Distributed Supercomputing (SOS) 2009, Hilton Head, SC, USA, March 9-12, 2009. BibTeX Citation
  48. Christian Engelmann. High-Performance Computing Research at Oak Ridge National Laboratory. Invited talk at the Reading Annual Computational Science Workshop, Reading, United Kingdom, December 8, 2008. Abstract Presentation BibTeX Citation
  49. Christian Engelmann. Modular Redundancy in HPC Systems: Why, Where, When and How?. Invited talk at the 1st HPC Resiliency Summit: Workshop on Resiliency for Petascale HPC 2008, in conjunction with the 1st Los Alamos Computer Science Symposium (LACSS) 2008, Santa Fe, NM, USA, October 15, 2008. Abstract Presentation BibTeX Citation
  50. Christian Engelmann. Resiliency for High-Performance Computing. Invited talk at the 2nd Collaborative and Grid Computing Technologies Workshop (CGCTW) 2008, Cancun, Mexico, April 10-12, 2008. Abstract Presentation BibTeX Citation
  51. Christian Engelmann. Advanced Fault Tolerance Solutions for High Performance Computing. Seminar at the Laboratoire d'Analyse et d’Architecture des Systémes, Centre National de la Recherche Scientifique, Toulouse, France, February 11, 2008. Abstract Presentation BibTeX Citation
  52. Christian Engelmann. Service-Level High Availability in Parallel and Distributed Systems. Seminar at the Department of Computer Science, University of Reading, Reading, United Kingdom, October 10, 2007. Abstract Presentation BibTeX Citation
  53. Christian Engelmann. Advanced Fault Tolerance Solutions for High Performance Computing. Invited talk at the Workshop on Trends, Technologies and Collaborative Opportunities in High Performance and Grid Computing (WTTC) 2007, Khon Kean, Thailand, June 8, 2007. Abstract Presentation BibTeX Citation
  54. Christian Engelmann. Advanced Fault Tolerance Solutions for High Performance Computing. Invited talk at the Workshop on Trends, Technologies and Collaborative Opportunities in High Performance and Grid Computing (WTTC) 2007, Bangkok, Thailand, June 4-5, 2007. Abstract Presentation BibTeX Citation
  55. Christian Engelmann. Operating System Research at ORNL: System-level Virtualization. Seminar at the Institute of Graphics and Parallel Processing, Johannes Kepler University, Linz, Austria, April 10, 2007. Abstract Presentation BibTeX Citation
  56. Christian Engelmann. Towards High Availability for High-Performance Computing System Services: Accomplishments and Limitations. Seminar at the Department of Computer Science, University of Reading, Reading, United Kingdom, March 14, 2007. Abstract Presentation BibTeX Citation
  57. Christian Engelmann. High Availability for Ultra-Scale High-End Scientific Computing. Seminar at the Department of Computer Science, University of Reading, Reading, United Kingdom, June 9, 2006. Abstract Presentation BibTeX Citation
  58. Stephen L. Scott and Christian Engelmann. Advancing Reliability, Availability and Serviceability for High-Performance Computing. Seminar at the Institute of Graphics and Parallel Processing, Johannes Kepler University, Linz, Austria, April 19, 2006. Abstract Presentation BibTeX Citation
  59. Christian Engelmann. High Availability for Ultra-Scale High-End Scientific Computing. Seminar at the Department of Computer Science, University of Reading, Reading, United Kingdom, October 18, 2005. Abstract Presentation BibTeX Citation
  60. Christian Engelmann. High Availability for Ultra-Scale High-End Scientific Computing. Seminar at the Department of Mathematics and Computer Science, Fayetteville State University, Fayetteville, NC, USA, September 26, 2005. Abstract Presentation BibTeX Citation
  61. Christian Engelmann. High Availability for Ultra-Scale High-End Scientific Computing. Seminar at the Department of Computer Science, University of Reading, Reading, United Kingdom, May 13, 2005. Abstract Presentation BibTeX Citation
  62. Christian Engelmann. High Availability for Ultra-Scale High-End Scientific Computing. Seminar at the Center for Entrepreneurship and Information Technology, Louisiana Tech University, Ruston, LA, USA, April 15, 2005. Abstract Presentation BibTeX Citation
  63. Christian Engelmann. Diskless Checkpointing on Super-scale Architectures – Applied to the Fast Fourier Transform. Invited talk at the 11th SIAM Conference on Parallel Processing for Scientific Computing (SIAM PP) 2004, San Francisco, CA, USA, February 25, 2004. Abstract Presentation BibTeX Citation
  64. Christian Engelmann. Super-scalable Algorithms – Next Generation Supercomputing on 100,000 and more Processors. Seminar at the Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA, January 29, 2004. Abstract Presentation BibTeX Citation
  65. Christian Engelmann. Distributed Peer-to-Peer Control for Harness. Seminar at the Department of Computer Science, North Carolina State University, Raleigh, NC, USA, February 11, 2004. Abstract Presentation BibTeX Citation

Co-advised Theses

  1. Ian S. Jones. Simulation of Large Scale Architectures on High Performance Computers. Master’s thesis, Department of Computer Science, University of Reading, UK, October 22, 2010. Thesis research performed at Oak Ridge National Laboratory. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory); George Bosilca (University of Tennessee, Knoxville). Abstract Publication Presentation BibTeX Citation
  2. Swen Böhm. Development of a RAS Framework for HPC Environments: Realtime Data Reduction of Monitoring Data. Master’s thesis, Department of Computer Science, University of Reading, UK, March 12, 2010. Thesis research performed at Oak Ridge National Laboratory. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory); George Bosilca (University of Tennessee, Knoxville). Abstract Publication Presentation BibTeX Citation
  3. Frank Lauer. Simulation of Advanced Large-Scale HPC Architectures. Master’s thesis, Department of Computer Science, University of Reading, UK, March 12, 2010. Thesis research performed at Oak Ridge National Laboratory. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory); George Bosilca (University of Tennessee, Knoxville). Abstract Publication Presentation BibTeX Citation
  4. Antonina Litvinova. RAS Framework Engine Prototype. Master’s thesis, Department of Computer Science, University of Reading, UK, September 22, 2009. Thesis research performed at Oak Ridge National Laboratory. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory); George Bosilca (University of Tennessee, Knoxville). Abstract Publication Presentation BibTeX Citation
  5. Björn Könning. Virtualized Environments for the Harness Workbench. Master’s thesis, Department of Computer Science, University of Reading, UK, March 14, 2007. Thesis research performed at Oak Ridge National Laboratory. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory). Abstract Publication Presentation BibTeX Citation
  6. Matthias Weber. High Availability for the Lustre File System. Master’s thesis, Department of Computer Science, University of Reading, UK, March 14, 2007. Thesis research performed at Oak Ridge National Laboratory. Double diploma in conjunction with the Department of Engineering I, Technical College for Engineering and Economics (FHTW) Berlin, Germany. Advisors: Prof. Vassil N. Alexandrov (University of Reading); Christian Engelmann (Oak Ridge National Laboratory). Abstract Publication Presentation BibTeX Citation
  7. Ronald Baumann. Design and Development of Prototype Components for the Harness High-Performance Computing Workbench. Master’s thesis, Department of Computer Science, University of Reading, UK, March 6, 2006. Thesis research performed at Oak Ridge National Laboratory. Double diploma in conjunction with the Department of Engineering I, Technical College for Engineering and Economics (FHTW) Berlin, Germany. Advisors: Prof. Vassil N. Alexandrov (University of Reading); George A. (Al) Geist and Christian Engelmann (Oak Ridge National Laboratory). Abstract Publication Presentation BibTeX Citation
  8. Kai Uhlemann. High Availability for High-End Scientific Computing. Master’s thesis, Department of Computer Science, University of Reading, UK, March 6, 2006. Thesis research performed at Oak Ridge National Laboratory. Double diploma in conjunction with the Department of Engineering I, Technical College for Engineering and Economics (FHTW) Berlin, Germany. Advisors: Prof. Vassil N. Alexandrov (University of Reading); George A. (Al) Geist and Christian Engelmann (Oak Ridge National Laboratory). Abstract Publication Presentation BibTeX Citation

Theses

  1. Christian Engelmann. Symmetric Active/Active High Availability for High-Performance Computing System Services. PhD thesis, Department of Computer Science, University of Reading, UK, December 8, 2008. Thesis research performed at Oak Ridge National Laboratory. Advisor: Prof. Vassil N. Alexandrov (University of Reading). Abstract Publication Presentation BibTeX Citation
  2. Christian Engelmann. Distributed Peer-to-Peer Control for Harness. Master’s thesis, Department of Computer Science, University of Reading, UK, July 7, 2001. Thesis research performed at Oak Ridge National Laboratory. Double diploma in conjunction with the Department of Engineering I, Technical College for Engineering and Economics (FHTW) Berlin, Germany. Advisors: Prof. Vassil N. Alexandrov (University of Reading); George A. (Al) Geist (Oak Ridge National Laboratory). Abstract Publication Presentation BibTeX Citation
  3. Christian Engelmann. Distributed Peer-to-Peer Control for Harness. Master’s thesis, Department of Engineering I, Technical College for Engineering and Economics (FHTW) Berlin, Germany, February 23, 2001. Thesis research performed at Oak Ridge National Laboratory. Double diploma in conjunction with the Department of Computer Science, University of Reading, UK. Advisors: Prof. Uwe Metzler (Technical College for Engineering and Economics (FHTW) Berlin); George A. (Al) Geist (Oak Ridge National Laboratory). Abstract Publication Presentation BibTeX Citation

Symbols: Abstract Abstract, Publication Publication, Presentation Presentation, BibTeX Citation BibTeX Citation