Show simple item record Moye, Roger en 2016-11-14T16:17:27Z en 2016-11-14T16:17:27Z en 2016-11-14 en
dc.identifier.uri en
dc.description.abstract Maintaining a high-performance computing (HPC) infrastructure in an academic research environment is a daunting task. Coupled with lean budgets and limited staff, the need for a self-healing cluster becomes all the more important. It is possible to achieve nearly 100% uptime on HPC compute nodes by utilizing job scheduling features that will pre-emptively terminate jobs before they cause problems on HPC systems, or prevent new jobs from running should a potential problem already exist, thereby freeing up time for the systems administrators to work on tasks other than cluster recovery. en
dc.language.iso en_US en
dc.relation.ispartofseries HPCSYSPROS16; 1 en
dc.rights This content is released under the Creative Commons Attribution 3.0 Unported license ( This license includes the following terms: You are free to share - to copy, distribute and transmit the work and to remix - to adapt the work under the following conditions: attribution - you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work. en
dc.rights.uri en
dc.subject HPC; Job Schedulers; Compute Node Availability; System Uptime en
dc.title Increasing HPC Resiliency Leads to Greater Productivity en
dc.type Article en

Files in this item

This item appears in the following Collection(s)

Show simple item record

Search IUScholarWorks

Advanced Search


My Account