Checklist to Improve Your Job’s Scheduling Time
Here is a checklist to help improve your job’s scheduling time:
1. Run command “sq <NetID>” or “squeue -u <NetID>” to show the jobs in the queue. If a job is Pending, the last column will show the reason provided by SLURM. Here are common reasons with their meaning:
- a. Priority: Job queued behind some higher priority jobs. For information on how job priority is computed, review Job Priority Factors.
- b. Resources: Job waiting for available resources. Users can use the powertools command “node_status” to see a current list of available node status. For an overview of resources, see Cluster Resources.
- c. Dependency: Job waiting for the jobs it depends on to complete.
- d. JobHeldAdmin: Job held by a system administrator. Contact system administrator for assistance using the Contact Forms.
- e. JobHeldUser: Job held by you. Run “scontrol release <job_id>” to release the hold.
- f. JobArrayTaskLimit: You reached the job array task limit per fairshare policy. Job waiting for array jobs to complete.
- g. QOSMaxCpuPerUserLimit: You reached the maximum CPU per user limit per fairshare policy.
- h. QOSMaxJobsPerUserLimit: You reached maximum jobs per user limit per fairshare policy.
- i. BadConstraints: Job constraints cannot be satisfied. You can put any constraints for the particular type of nodes, number of nodes, number of CPUs per node, etc. to control the way your jobs are executed. Check the job script to ensure the requested resources are available.
2. Job submission during peak usage times may contribute to longer wait times. Check the ICER Dashboard to see the current system load. Compare your expected queue time to the “Queue Times” data, which provides the average queue time of the last 100 completed jobs of similar type, and adjust your expectation of job queue time accordingly.
3. Review the HPCC fairshare policy. If you find that your jobs are waiting in the queue longer than another user’s similar jobs, it may be due to your large resource usage recently. Users can find the FAIRSHARE contribution to a job priority by running command "sprio -u $USER" and compare the fairshare numbers.
4. Here are some tricks that may lead jobs to be scheduled faster:
- a. Request a <4hrs job time. Short jobs can be scheduled on buyin nodes as available.
- b. Avoid requesting high-demand resources. High demand for certain resources, like GPUs, increases queue times due to availability limitations. Users can balance a job’s queue and execution times between running with and without a GPU for the optimal situation.
- c. Avoid overestimating needed resources. A more accurate estimation of needed resources (i.e. time, CPUs and GPUs number, GB of memory) can reduce queue times since larger size jobs are still more difficult to be scheduled due to the availability of the resources.
- d. Avoid unnecessary constraints in the job script. Fewer constraints will give SLURM flexibility to collect the available resources for the jobs leading to a faster schedule.