Slurm is a job scheduler and resource management program with the combined functionality of Moab and Torque. Queues, accounts, reservations, limits, preemption, job priority, and many other facets of scheduling systems in Moab and Torque are nearly identical in Slurm.
Slurm is a job scheduler and resource management program. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.
There are a few key differences to be aware of between Moab/Torque and Slurm:
- Terminology
- This is largely the same with a few key differences
- Moab Queues are referred to as Partitions in Slurm
- PBS parameters in Moab job scripts are analogus to SBATCH parameters in Slurm
</br>
- Scheduler Policy
...
...
@@ -16,160 +11,67 @@ There are a few key differences to be aware of between Moab/Torque and Slurm:
- Resource requests are enforced through cgroups
- Nodes, cores/tasks, memory, walltime, account, and queue information must be specified in job scripts
- No default values are set for these resources
- "Burst" job submission is simplified in Slurm
- One central burst queues is used, and user account and qos no longer need to be specified in your job script
</br>
- Commands
- Functions from multiple Moab/Torque commands are typically combined in Slurm commands
-`qsub -> sbatch`
-`qsub/pbsdsh -> srun`
-`qstat/showq -> squeue`
-`checknode/showbf -> sinfo`
-`checkjob/mschedctl -> scontrol`
-`sbatch`
-`srun`
-`squeue`
-`sinfo`
-`scontrol`
</br>
- Command Examples
- Here are a few examples of equivalent commands betwene the two schedulers
-`qsub test.sh -> sbatch test.sh`
-`showq -u <uid> -> squeue -u <uid>`
-`checkjob <job_id> -> scontrol show job <job_id>`
-`showbf -f gpu -> sinfo -p gpu`
-`qsub -I -A cades-birthright -w group_list=birthright -q gpu -> srun -A birthright -p gpu --pty /bin/bash`
-`sbatch test.sh`
-`squeue -u <uid>`
-`scontrol show job <job_id>`
-`sinfo -p gpu`
-`srun -A birthright -p gpu --pty /bin/bash`
</br>
## Slurm Challenge 1
### Updating Job Scripts - Adapting PBS Scripts
These modification must be made to existing PBS jobs scripts to make them compatible with Slurm:
-`$PBS_O_WORKDIR -> $SLURM_SUBMIT_DIR`
- Environment variables such as PBS_O_WORKDIR will need to be replaced with Slurm equivalents, or defined manually
-`-A birthright-burst -> -A birthright`
- The account used to submit the job may or may not need to be updated. Valid Slurm account names can be found using this command:
- sacctmgr show assoc where user=<uid> format=account
-`-q gpu -> -q gpu`
- The queue that the job is submitted to may need to be updated. Valid queue names can be found with the sinfo command.
-`-l walltime=<time>`
- A maximum walltime request is required
-`-l mem=<number>[unit]`
- A memory request is required
-`-l nodes=1:ppn=1`
- Nodes and ppn requests are required
### Challenge
1. Switch to your terminal that is logged in to or-slurm-login01.ornl.gov
- This directory contains ex1_job_script.pbs, an example “Hello World” PBS job script
3. Make a copy of the example script and name it ex1_job_script.sbatch
4. Using the previous slide as an reference, update the job script to run under Slurm
- Add a walltime request of 10 minutes to the script
- Add a memory request of 10 gigabytes to the script
- Change the queue name from gpu to testing
- The testing queue is a limited queue for short-running test jobs
5. After updating the script, try to submit it using this command:
```
sbatch ex1_job_script.sbatch
```
6. If any errors occur when submitting, try to fix the job script and re-submit to test. Feel free to ask for help if you encounter an error you can’t get past.
The solution to this challenge is available in the solutions folder under ex1_job_script.sbatch
### Submit Job Scripts
1. Switch to your terminal that is logged in to or-slurm-login.ornl.gov
2. This directory contains ex1_job_script.sbatch, an example “Hello World” slurm job script
Using the previous slide as an reference, update the job script to run under Slurm
Change a walltime request of 10 minutes to the script
Change a memory request of 10 gigabytes to the script
Change the queue name to testing
The testing queue is alimited queue for short-running test jobs
3. After updating the script, try to submit it using this command:
sbatch ex1_job_script.sbatch
4. If any errors occur when submitting, try to fix the job script and re-submit to test. Feel free to ask for help if you encounter an error you can’t get past.
## Slurm Challenge 2
### Updating Job Scripts - Rewriting PBS Scripts
The recommended method of adapting PBS job scripts to Slurm is to re-write them using SBATCH parameters. The example below shows how to translate the most frequently used parameters:
* This directory contains ex2_job_script.pbs, an example PBS job script for running Quantum Espresso
3. Make a copy of the example script and name it ex2_job_script.sbatch
4. Using the previous two slides as an reference, translate the job script from PBS to SBATCH
5. After converting the job script, try to submit it using this command:
```
sbatch ex2_job_script.sbatch
```
6. If any errors occur when submitting, try to fix the job script and re-submit to test. Feel free to ask for help if you encounter an error you can’t get past.
## Slurm Challenge 2
About this challenge:
- Quantum Espresso is a suite of electron-structure calculation and material modeling tools. The job script and data files used in the challenge are slightly modified for this training, but are meant to demonstrate how you could run these programs on the CADES condos for a real production run.
### Run a program across multiple cores and multiple nodes
1. Switch to your terminal that is logged in to or-slurm-login.ornl.gov
Update -A -P --mail-user and submit it using this command:
sbatch ex2_job_script.sbatch
3. If any errors occur when submitting, try to fix the job script and re-submit to test. Feel free to ask for help if you encounter an error you can’t get past.
4. About this challenge:
The previous example demonstrated how to run a job on one core and one node. This example extends the previous one to run a program across multiple cores and multiple nodes. A few of the environment variables Slurm provides are also demonstrated.
The solution to this challenge is available in the solutions folder under ex2_job_script.sbatch
## Slurm Challenge 3
### Updating Job Scripts - Job Arrays
Jobs using job arrays can easily be adapted to work in Slurm. And example PBS job array script is given below:
3. Update -A -P --mail-user and submit it using this command:
sbatch ex3_job_script.sbatch
```
6.If any errors occur when submitting, try to fix the job script and re-submit to test. Feel free to ask for help if you encounter an error you can’t get past.
4. If any errors occur when submitting, try to fix the job script and re-submit to test. Feel free to ask for help if you encounter an error you can’t get past.
5.Observe the output
The solution to this challenge is available in the solutions folder under ex3_job_script.sbatch