SLURM Job Failure Due to the "oom" Error
Many users have asked us how to deal with error messages such as:
slurmstepd: error: Detected 2 oom-kill event(s) in StepId=xxxxx.batch cgroup.
In the above message, oom stands for "out of memory". It means that your process needs more memory than what is available. Therefore, you need to request more memory in your job script, via --mem or --mem-per-cpu.
To estimate a proper memory for your new job submission, taking a look at the resource usage of the failed job can be helpful. To do so, you can use the following powertools command:
js -j <job ID>
In the output, look for the "MaxRSS" entry which tells the maximum amount of memory used by your job. Based on this value, you will be able to increase your memory request accordingly.