Commit 843c2be0 authored by Dietz, Colin's avatar Dietz, Colin
Browse files

Added Slurm examples and solutions

parent 689bd290
This has instructions for the Slrum QE challenge.
# Slurm Challenges
## What is Slurm?
Slurm is a job scheduler and resource management program with the combined functionality of Moab and Torque. Queues, accounts, reservations, limits, preemption, job priority, and many other facets of scheduling systems in Moab and Torque are nearly identical in Slurm.
There are a few key differences to be aware of between Moab/Torque and Slurm:
* Terminology
** This is largely the same with a few key differences
*** Moab Queues are referred to as Partitions in Slurm
*** PBS parameters in Moab job scripts are analogus to SBATCH parameters in Slurm
* Scheduler Policy
** Resources that are not requested are not allocated
*** Resource requests are enforced through cgroups
** Nodes, cores/tasks, memory, walltime, account, and queue information must be specified in job scripts
*** No default values are set for these resources
** "Burst" job submission is simplified in Slurm
*** One central burst queues is used, and user account and qos no longer need to be specified in your job script
* Commands
** Functions from multiple Moab/Torque commands are typically combined in Slurm commands
*** `qsub -> sbatch`
*** `qsub/pbsdsh -> srun`
*** `qstat/showq -> squeue`
*** `checknode/showbf -> sinfo`
*** `checkjob/mschedctl -> scontrol`
* Command Examples
** Here are a few examples of equivalent commands betwene the two schedulers
*** `qsub test.sh -> sbatch test.sh`
*** `showq -u <uid> -> squeue -u <uid>`
*** `checkjob <job_id> -> scontrol show job <job_id>`
*** `showbf -f gpu -> sinfo -p gpu`
*** `qsub -I -A cades-birthright -w group_list=birthright -q gpu -> srun -A birthright -p gpu --pty /bin/bash`
## Slurm Challenge 1
### Updating Job Scripts - Adapting PBS Scripts
These modification must be made to existing PBS jobs scripts to make them compatible with Slurm:
* `$PBS_O_WORKDIR -> $SLURM_SUBMIT_DIR`
** Environment variables such as PBS_O_WORKDIR will need to be replaced with Slurm equivalents, or defined manually
* `-A birthright-burst -> -A birthright`
** The account used to submit the job may or may not need to be updated. Valid Slurm account names can be found using this command:
*** sacctmgr show assoc where user=<uid> format=account
* `-q gpu -> -q gpu`
** The queue that the job is submitted to may need to be updated. Valid queue names can be found with the sinfo command.
* `-l walltime=<time>`
** A maximum walltime request is required
* `-l mem=<number>[unit]`
** A memory request is required
* `-l nodes=1:ppn=1`
** Nodes and ppn requests are required
### Challenge
1. Switch to your terminal that is logged in to or-slurm-login01.ornl.gov
2. Navigate to:
```
/lustre/or-hydra/cades-birthright/<user_id>/cades-spring-training-master/slurm/example1/
```
** This directory contains ex1_job_script.pbs, an example “Hello World” PBS job script
3. Make a copy of the example script and name it ex1_job_script.sbatch
4. Using the previous slide as an reference, update the job script to run under Slurm
** Add a walltime request of 10 minutes to the script
** Add a memory request of 10 gigabytes to the script
** Change the queue name from gpu to testing
*** The testing queue is a limited queue for short-running test jobs
5. After updating the script, try to submit it using this command:
```
sbatch ex1_job_script.sbatch
```
6. If any errors occur when submitting, try to fix the job script and re-submit to test. Feel free to ask for help if you encounter an error you can’t get past.
The solution to this challenge is available in the solutions folder under ex1_job_script.sbatch
## Slurm Challenge 2
### Updating Job Scripts - Rewriting PBS Scripts
The recommended method of adapting PBS job scripts to Slurm is to re-write them using SBATCH parameters. The example below shows how to translate the most frequently used parameters:
```
#!/bin/bash ............................. #!/bin/bash
#PBS -N hello-world-example ............. #SBATCH -J hello-world-example
#PBS -A birthright-burst ................ #SBATCH -A birthright
#PBS -W group_list=cades-birthright .....
#PBS -l qos=burst .......................
#PBS -q batch ........................... #SBATCH -p burst
/#SBATCH -N 1
#PBS -l nodes=1:ppn=32 .................{ #SBATCH -n 32
\#SBATCH -c 1
#PBS -l mem=10g ......................... #SBATCH --mem=10g
#PBS -l walltime=00:10:00 ............... #SBATCH -t 00:10:00
echo “Hello World” ...................... echo "Hello World"
```
### Challenge
1. Switch to your terminal that is logged in to or-slurm-login01.ornl.gov
2. Navigate to:
```
/lustre/or-hydra/cades-birthright/<user_id>/cades-spring-training-master/slurm/example2/
```
** This directory contains ex2_job_script.pbs, an example PBS job script for running Quantum Espresso
3. Make a copy of the example script and name it ex2_job_script.sbatch
4. Using the previous two slides as an reference, translate the job script from PBS to SBATCH
5. After converting the job script, try to submit it using this command:
```
sbatch ex2_job_script.sbatch
```
6. If any errors occur when submitting, try to fix the job script and re-submit to test. Feel free to ask for help if you encounter an error you can’t get past.
About this challenge:
* Quantum Espresso is a suite of electron-structure calculation and material modeling tools. The job script and data files used in the challenge are slightly modified for this training, but are meant to demonstrate how you could run these programs on the CADES condos for a real production run.
The solution to this challenge is available in the solutions folder under ex2_job_script.sbatch
## Slurm Challenge 3
### Updating Job Scripts - Job Arrays
Jobs using job arrays can easily be adapted to work in Slurm. And example PBS job array script is given below:
```
#!/bin/bash
#PBS -N spring-training-ex3
#PBS -A birthright-burst
#PBS -W group_list=cades-birthright
#PBS -q batch
#PBS -l qos=burst
#PBS -l nodes=1:ppn=32
#PBS -l mem=10g
#PBS -l walltime=00:10:00
#PBS -t 0-1%2
module purge
module load PE-intel
module load QE
cd $PBS_O_WORKDIR
input_files=(in in2)
mpirun pw.x -in "../data/${input_files[$PBS_ARRAYID]}"
```
* Important Notes
** Slurm uses the parameter #SBATCH -a to specify job arrays, but has the same syntax as PBS for specifying the array index range and slot limit.
** The Slurm equivalent of PBS_ARRAYID is SLURM_ARRAY_TASK_ID
### Challenge
1. Switch to your terminal that is logged in to or-slurm-login01.ornl.gov
2. Navigate to:
```
/lustre/or-hydra/cades-birthright/<user_id>/cades-spring-training-master/slurm/example3/
```
** This directory contains ex3_job_script.pbs, an example PBS job script for running Quantum Espresso with a job array
3. Make a copy of the example script and name it ex3_job_script.sbatch
4. Using the previous slide, as well as the solution to Slurm Challenge 2, update the ex3_job_script.sbatch job script from PBS to SBATCH
5. After converting the job script, try to submit it using this command:
```
sbatch ex3_job_script.sbatch
```
6. If any errors occur when submitting, try to fix the job script and re-submit to test. Feel free to ask for help if you encounter an error you can’t get past.
The solution to this challenge is available in the solutions folder under ex3_job_script.sbatch
......@@ -7,11 +7,13 @@
nstep = 2
tstress = .false.,
tprnfor = .false.,
outdir = './',
wfcdir = './',
pseudo_dir = '../data/pseudo/'
! Change the value of outdir to your Lustre directory to save the output data
outdir = '/dev/shm'
! Delete the disk_io line in a real run to use the default setting
disk_io = 'none'
max_seconds = 120
! Set max_seconds to several minutes before your walltime to ensure all output is saved
max_seconds = 60
/
&system
ibrav = 0
......
......@@ -7,11 +7,13 @@
nstep = 2
tstress = .false.,
tprnfor = .false.,
outdir = './',
wfcdir = './',
pseudo_dir = '../data/pseudo/'
! Change the value of outdir to your Lustre directory to save the output data
outdir = '/dev/shm'
! Delete the disk_io line in a real run to use the default setting
disk_io = 'none'
max_seconds = 120
! Set max_seconds to several minutes before your walltime to ensure all output is saved
max_seconds = 60
/
&system
ibrav = 0
......
#!/bin/bash
#PBS -N spring-training-ex1
#PBS -A birthright-burst
#PBS -w group_list=cades-birthright
#PBS -q batch
#PBS -l qos=burst
#PBS -A birthright
#PBS -W group_list=cades-birthright
#PBS -q gpu
#PBS -l nodes=1:ppn=32
#PBS -l mem=10g
#PBS -l walltime=30:00
module purge
module load PE-intel
module load QE
mpirun pw.x -in ../data/in
cd $PBS_O_WORKDIR
echo “Hello World”
......@@ -2,17 +2,16 @@
#PBS -N spring-training-ex2
#PBS -A birthright-burst
#PBS -w group_list=cades-birthright
#PBS -W group_list=cades-birthright
#PBS -q batch
#PBS -l qos=burst
#PBS -l nodes=1:ppn=32
#PBS -l mem=100g
#PBS -l walltime=30:00
#PBS -a 0-1%2
#PBS -l mem=10g
#PBS -l walltime=00:10:00
module purge
module load PE-intel
module load QE
input_files=(in in2)
mpirun pw.x -in "${input_files[$SLURM_ARRAY_TASK_ID]}"
cd $PBS_O_WORKDIR
mpirun pw.x -in ../data/in
#!/bin/bash
#PBS -N spring-training-ex3
#PBS -A birthright-burst
#PBS -W group_list=cades-birthright
#PBS -q batch
#PBS -l qos=burst
#PBS -l nodes=1:ppn=32
#PBS -l mem=10g
#PBS -l walltime=00:10:00
#PBS -t 0-1%2
module purge
module load PE-intel
module load QE
cd $PBS_O_WORKDIR
input_files=(in in2)
mpirun pw.x -in "../data/${input_files[$PBS_ARRAY_ID]}"
#!/bin/bash
#PBS -N spring-training-ex1
#PBS -A birthright
#PBS -W group_list=cades-birthright
#PBS -q testing
#PBS -l nodes=1:ppn=32
#PBS -l mem=10g
#PBS -l walltime=00:10:00
cd $SLURM_SUBMIT_DIR
echo “Hello World”
#!/bin/bash
#SBATCH -J spring-training-ex1
#SBATCH -J spring-training-ex2
#SBATCH -A birthright
#SBATCH -p burst
#SBATCH -N 2
#SBATCH -p testing
#SBATCH -N 1
#SBATCH -n 32
#SBATCH -c 1
#SBATCH --mem=0
#SBATCH -t 00:30:00
export OMP_NUM_THREADS=1
#SBATCH --mem=10g
#SBATCH -t 00:10:00
module purge
module load PE-intel
module load QE
mpirun pw.x -in ./in
cd $SLURM_SUBMIT_DIR
mpirun pw.x -in ../data/in
#!/bin/bash
#SBATCH -J spring-training-ex2
#SBATCH -J spring-training-ex3
#SBATCH -A birthright
#SBATCH -p burst
#SBATCH -p testing
#SBATCH -N 1
#SBATCH -n 32
#SBATCH -c 1
#SBATCH --mem=0
#SBATCH -t 00:30:00
#SBATCH --mem=10g
#SBATCH -t 00:10:00
#SBATCH -a 0-1%2
export OMP_NUM_THREADS=1
module purge
module load PE-intel
module load QE
cd $SLURM_SUBMIT_DIR
input_files=(in in2)
mpirun pw.x -in "${input_files[$SLURM_ARRAY_TASK_ID]}"
mpirun pw.x -in "../data/${input_files[$SLURM_ARRAY_TASK_ID]}"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment