Increasing HPC Resiliency Leads to Greater Productivity

Loading...
Thumbnail Image

Other Version

External File or Record

Can’t use the file because of accessibility barriers? Contact us

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Maintaining a high-performance computing (HPC) infrastructure in an academic research environment is a daunting task. Coupled with lean budgets and limited staff, the need for a self-healing cluster becomes all the more important. It is possible to achieve nearly 100% uptime on HPC compute nodes by utilizing job scheduling features that will pre-emptively terminate jobs before they cause problems on HPC systems, or prevent new jobs from running should a potential problem already exist, thereby freeing up time for the systems administrators to work on tasks other than cluster recovery.

Series and Number:

HPCSYSPROS16; 1

EducationalLevel:

Is Based On:

Target Name:

Teaches:

Table of Contents

Description

Citation

Journal

DOI

Rights

This content is released under the Creative Commons Attribution 3.0 Unported license (http://creativecommons.org/licenses/by/3.0/). This license includes the following terms: You are free to share - to copy, distribute and transmit the work and to remix - to adapt the work under the following conditions: attribution - you must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). For any reuse or distribution, you must make clear to others the license terms of this work.