A Checkpoint and Restart Service Specification for Open MPI

dc.contributor.authorHursey, Joshua; Squyres, Jeffrey; Lumsdaine, Andrew
dc.date.accessioned2025-11-12T21:07:18Z
dc.date.available2025-11-12T21:07:18Z
dc.date.issued2006-07
dc.description.abstractHPC systems are growing in both complexity and size, increasing the opportunity for system failures. Checkpoint and restart techniques are one of many fault tolerance techniques developed for such adverse runtime conditions. Because of the variety of available approaches for checkpoint and restart, HPC system libraries, such as MPI, seeking to incorporate these techniques would benefit greatly from a portable, extensible checkpoint and restart framework. This paper presents a specification for such a framework in Open MPI that allows for the integration of a variety of checkpoint/restart systems and protocols. The modular design of the framework allows researchers to contribute to specialized areas without requiring knowledge of the entirety of the code base.
dc.identifier.urihttps://hdl.handle.net/2022/34474
dc.relation.ispartofseriesIndiana University Computer Science Technical Reports; TR635
dc.rightsThis work is protected by copyright unless stated otherwise.
dc.rights.uri
dc.titleA Checkpoint and Restart Service Specification for Open MPI

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
TR635.pdf
Size:
166.18 KB
Format:
Adobe Portable Document Format
Can’t use the file because of accessibility barriers? Contact us