ItemHPC System Acceptance: Controlled Chaos(2016-11-14) Peltz Jr., Paul; Fields, ParksOver the last six decades, Los Alamos National Laboratory (LANL) has acquired, accepted, and integrated over 100 new HPC systems, from MANIAC in 1952 to Trinity in 2016. These systems range from small clusters to large supercomputers. Each type of system has its own challenges and having a well established and proven test, acceptance, and integration plan is valuable to the site and vendor to expedite the process. The topic of systems acceptance itself is quite broad, and for the purposes of this paper, it will be mostly focused on the system’s software and hardware components. Some discussion will be given to performance testing as well, but the purpose of this paper is to help HPC System Administrators with the acceptance process. ItemCluster Computing with OpenHPC(2016-11-14) Schulz, Karl W.; Baird, C. Reese; Brayford, David; Georgiou, Yiannis; Kurtzer, Gregory M.; Simmel, Derek; Sterling, Thomas; Sundararajan, Nirmala; Van Hensbergen, EricOpenHPC is a newly formed, community-based project that is providing an integrated collection of HPC-centric software components that can be used to implement a full-featured reference HPC compute resource. Components span the entire HPC software ecosystem including provisioning and system administration tools, resource management, I/O services, development tools, numerical libraries, and performance analysis tools. Common clustering tools and scientific libraries are distributed as pre-built and validated binaries and are meant to seamlessly layer on top of existing Linux distributions. The architecture of OpenHPC is intentionally modular to allow end users to pick and choose from the provided components, as well as to foster a community of open contribution. This paper presents an overview of the underlying community vision, governance structure, packaging conventions, build and release infrastructure and validation methodologies. ItemBlue Waters Resource Management and Job Scheduling Best Practices(2016-11-14) Islam, Sharif; Bode, Brett; Enos, JeremyThis paper describes resource management and job scheduling best practices learned from operating Blue Waters (a petascale Cray XE+XK supercomputer with 26,864 compute nodes) since April 2013. We will describe various aspects of such operation while focusing on the challenges experienced while maintaining a large, shared computational resource such as Blue Waters. ItemAccount Management of a Large-Scale HPC Resource(2016-11-14) Bode, Brett; Bouvet, Tim; Enos, Jeremy; Islam, SharifBlue Waters is the largest system that Cray has built and operates in a very open network environment. This paper will discuss the design of the Blue Waters logical administrative network and how that design provides a secure and reliable environment that separates the user and administrative access paths. The paper will then describe how accounts and other user and project information is provisioned efficiently across its 27,000+ nodes. ItemIncreasing HPC Resiliency Leads to Greater Productivity(2016-11-14) Moye, RogerMaintaining a high-performance computing (HPC) infrastructure in an academic research environment is a daunting task. Coupled with lean budgets and limited staff, the need for a self-healing cluster becomes all the more important. It is possible to achieve nearly 100% uptime on HPC compute nodes by utilizing job scheduling features that will pre-emptively terminate jobs before they cause problems on HPC systems, or prevent new jobs from running should a potential problem already exist, thereby freeing up time for the systems administrators to work on tasks other than cluster recovery.