Stay Up to Date with HPCC System Status
Staying up to date with HPCC system status helps users ensure a smooth workflow and determine the cause of issues that can arise during maintenance.
What does the system downtime mean to HPCC users? From ICER’s most recent downtime on August 17th, 2021:
All interactive access will be disabled (including SSH, OpenOnDemand, Globus Online endpoints, and SMB), and no jobs that would overlap this maintenance window will run.
All interactive access is not affected before the scheduled downtime; however, ICER received tickets from users asking why their jobs could not run up to seven days before the downtime. They reported that their jobs’ status was “PENDING” and the reason provided by Slurm was “(ReqNodeNotAvail, Reserved for maintenance)”. This is because these jobs’ requested time overlapped with the system downtime. Slurm is designed to have the capability to hold these jobs until after the scheduled downtime and only schedule the jobs that can be completed before the downtime. This avoids the forced termination of the uncompleted jobs at the downtime. In this situation, users do not need to do anything. Jobs will be scheduled and started after the system update and Slurm scheduler are resumed.
As a friendly reminder, users can check the system status announcement to see when the system becomes available after the update. In addition to system downtime, users can check system status when any abnormal behavior of interactive access or job scheduling is observed. Three channels of system status announcements are listed below: