Commit 9583ca7f authored by Liu, Hong's avatar Liu, Hong
Browse files

Merge branch 'update_slurm' into 'master'

Update slurm

See merge request !4
parents c54298f6 fef65ef4
......@@ -2,13 +2,8 @@
## What is Slurm?
Slurm is a job scheduler and resource management program with the combined functionality of Moab and Torque. Queues, accounts, reservations, limits, preemption, job priority, and many other facets of scheduling systems in Moab and Torque are nearly identical in Slurm.
Slurm is a job scheduler and resource management program. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.
There are a few key differences to be aware of between Moab/Torque and Slurm:
- Terminology
- This is largely the same with a few key differences
- Moab Queues are referred to as Partitions in Slurm
- PBS parameters in Moab job scripts are analogus to SBATCH parameters in Slurm
</br>
- Scheduler Policy
......@@ -16,160 +11,67 @@ There are a few key differences to be aware of between Moab/Torque and Slurm:
- Resource requests are enforced through cgroups
- Nodes, cores/tasks, memory, walltime, account, and queue information must be specified in job scripts
- No default values are set for these resources
- "Burst" job submission is simplified in Slurm
- One central burst queues is used, and user account and qos no longer need to be specified in your job script
</br>
- Commands
- Functions from multiple Moab/Torque commands are typically combined in Slurm commands
- `qsub -> sbatch`
- `qsub/pbsdsh -> srun`
- `qstat/showq -> squeue`
- `checknode/showbf -> sinfo`
- `checkjob/mschedctl -> scontrol`
- `sbatch`
- `srun`
- `squeue`
- `sinfo`
- `scontrol`
</br>
- Command Examples
- Here are a few examples of equivalent commands betwene the two schedulers
- `qsub test.sh -> sbatch test.sh`
- `showq -u <uid> -> squeue -u <uid>`
- `checkjob <job_id> -> scontrol show job <job_id>`
- `showbf -f gpu -> sinfo -p gpu`
- `qsub -I -A cades-birthright -w group_list=birthright -q gpu -> srun -A birthright -p gpu --pty /bin/bash`
- `sbatch test.sh`
- `squeue -u <uid>`
- `scontrol show job <job_id>`
- `sinfo -p gpu`
- `srun -A birthright -p gpu --pty /bin/bash`
</br>
## Slurm Challenge 1
### Updating Job Scripts - Adapting PBS Scripts
These modification must be made to existing PBS jobs scripts to make them compatible with Slurm:
- `$PBS_O_WORKDIR -> $SLURM_SUBMIT_DIR`
- Environment variables such as PBS_O_WORKDIR will need to be replaced with Slurm equivalents, or defined manually
- `-A birthright-burst -> -A birthright`
- The account used to submit the job may or may not need to be updated. Valid Slurm account names can be found using this command:
- sacctmgr show assoc where user=<uid> format=account
- `-q gpu -> -q gpu`
- The queue that the job is submitted to may need to be updated. Valid queue names can be found with the sinfo command.
- `-l walltime=<time>`
- A maximum walltime request is required
- `-l mem=<number>[unit]`
- A memory request is required
- `-l nodes=1:ppn=1`
- Nodes and ppn requests are required
### Challenge
1. Switch to your terminal that is logged in to or-slurm-login01.ornl.gov
2. Navigate to:
```
/lustre/or-hydra/cades-birthright/<user_id>/cades-spring-training-master/slurm/example1/
```
- This directory contains ex1_job_script.pbs, an example “Hello World” PBS job script
3. Make a copy of the example script and name it ex1_job_script.sbatch
4. Using the previous slide as an reference, update the job script to run under Slurm
- Add a walltime request of 10 minutes to the script
- Add a memory request of 10 gigabytes to the script
- Change the queue name from gpu to testing
- The testing queue is a limited queue for short-running test jobs
5. After updating the script, try to submit it using this command:
```
sbatch ex1_job_script.sbatch
```
6. If any errors occur when submitting, try to fix the job script and re-submit to test. Feel free to ask for help if you encounter an error you can’t get past.
The solution to this challenge is available in the solutions folder under ex1_job_script.sbatch
### Submit Job Scripts
1. Switch to your terminal that is logged in to or-slurm-login.ornl.gov
Navigate to:
/lustre/or-scratch/cades-birthright/<user_id>/cades-training-master/slurm/example1/
2. This directory contains ex1_job_script.sbatch, an example “Hello World” slurm job script
Using the previous slide as an reference, update the job script to run under Slurm
Change a walltime request of 10 minutes to the script
Change a memory request of 10 gigabytes to the script
Change the queue name to testing
The testing queue is alimited queue for short-running test jobs
3. After updating the script, try to submit it using this command:
sbatch ex1_job_script.sbatch
4. If any errors occur when submitting, try to fix the job script and re-submit to test. Feel free to ask for help if you encounter an error you can’t get past.
## Slurm Challenge 2
### Updating Job Scripts - Rewriting PBS Scripts
The recommended method of adapting PBS job scripts to Slurm is to re-write them using SBATCH parameters. The example below shows how to translate the most frequently used parameters:
```
#!/bin/bash ............................. #!/bin/bash
#PBS -N hello-world-example ............. #SBATCH -J hello-world-example
#PBS -A birthright-burst ................ #SBATCH -A birthright
#PBS -W group_list=cades-birthright .....
#PBS -l qos=burst .......................
#PBS -q batch ........................... #SBATCH -p burst
/#SBATCH -N 1
#PBS -l nodes=1:ppn=32 .................{ #SBATCH -n 32
\#SBATCH -c 1
#PBS -l mem=10g ......................... #SBATCH --mem=10g
#PBS -l walltime=00:10:00 ............... #SBATCH -t 00:10:00
echo “Hello World” ...................... echo "Hello World"
```
### Challenge
1. Switch to your terminal that is logged in to or-slurm-login01.ornl.gov
2. Navigate to:
```
/lustre/or-hydra/cades-birthright/<user_id>/cades-spring-training-master/slurm/example2/
```
* This directory contains ex2_job_script.pbs, an example PBS job script for running Quantum Espresso
3. Make a copy of the example script and name it ex2_job_script.sbatch
4. Using the previous two slides as an reference, translate the job script from PBS to SBATCH
5. After converting the job script, try to submit it using this command:
```
sbatch ex2_job_script.sbatch
```
6. If any errors occur when submitting, try to fix the job script and re-submit to test. Feel free to ask for help if you encounter an error you can’t get past.
## Slurm Challenge 2
About this challenge:
- Quantum Espresso is a suite of electron-structure calculation and material modeling tools. The job script and data files used in the challenge are slightly modified for this training, but are meant to demonstrate how you could run these programs on the CADES condos for a real production run.
### Run a program across multiple cores and multiple nodes
1. Switch to your terminal that is logged in to or-slurm-login.ornl.gov
Navigate to:
/lustre/or-scratch/cades-birthright/<user_id>/cades-training-master/slurm/example2/
2. This directory contains ex2_job_script.sbatch
Update -A -P --mail-user and submit it using this command:
sbatch ex2_job_script.sbatch
3. If any errors occur when submitting, try to fix the job script and re-submit to test. Feel free to ask for help if you encounter an error you can’t get past.
4. About this challenge:
The previous example demonstrated how to run a job on one core and one node. This example extends the previous one to run a program across multiple cores and multiple nodes. A few of the environment variables Slurm provides are also demonstrated.
The solution to this challenge is available in the solutions folder under ex2_job_script.sbatch
## Slurm Challenge 3
### Updating Job Scripts - Job Arrays
Jobs using job arrays can easily be adapted to work in Slurm. And example PBS job array script is given below:
```
#!/bin/bash
#PBS -N spring-training-ex3
#PBS -A birthright-burst
#PBS -W group_list=cades-birthright
#PBS -q batch
#PBS -l qos=burst
#PBS -l nodes=1:ppn=32
#PBS -l mem=10g
#PBS -l walltime=00:10:00
#PBS -t 0-1%2
module purge
module load PE-intel
module load QE
cd $PBS_O_WORKDIR
input_files=(in in2)
mpirun pw.x -in "../data/${input_files[$PBS_ARRAYID]}"
```
- Important Notes
- Slurm uses the parameter #SBATCH -a to specify job arrays, but has the same syntax as PBS for specifying the array index range and slot limit.
- The Slurm equivalent of PBS_ARRAYID is SLURM_ARRAY_TASK_ID
### Challenge
1. Switch to your terminal that is logged in to or-slurm-login01.ornl.gov
2. Navigate to:
```
/lustre/or-hydra/cades-birthright/<user_id>/cades-spring-training-master/slurm/example3/
```
* This directory contains ex3_job_script.pbs, an example PBS job script for running Quantum Espresso with a job array
3. Make a copy of the example script and name it ex3_job_script.sbatch
4. Using the previous slide, as well as the solution to Slurm Challenge 2, update the ex3_job_script.sbatch job script from PBS to SBATCH
5. After converting the job script, try to submit it using this command:
```
### Job Arrays
1. Switch to your terminal that is logged in to or-slurm-login.ornl.gov
Navigate to:
/lustre/or-scratch/cades-birthright/<user_id>/cades-training-master/slurm/example3/
2. This directory contains ex3_job_script.sbatch
3. Update -A -P --mail-user and submit it using this command:
sbatch ex3_job_script.sbatch
```
6. If any errors occur when submitting, try to fix the job script and re-submit to test. Feel free to ask for help if you encounter an error you can’t get past.
4. If any errors occur when submitting, try to fix the job script and re-submit to test. Feel free to ask for help if you encounter an error you can’t get past.
5. Observe the output
The solution to this challenge is available in the solutions folder under ex3_job_script.sbatch
#!/bin/bash
#SBATCH -J hello-world-example
#SBATCH -A birthright
#SBATCH -p burst
#SBATCH -N 1
#SBATCH -n 32
#SBATCH -c 1
#SBATCH --mem=0
#SBATCH -t 00:30:00
echo “Hello World”
#!/bin/bash
echo "Hello World! Node:${SLURMD_NODENAME} Core:${SLURM_PROCID}"
#!/bin/bash
#PBS -N spring-training-ex2
#PBS -A birthright-burst
#PBS -W group_list=cades-birthright
#PBS -q batch
#PBS -l qos=burst
#PBS -l nodes=1:ppn=32
#PBS -l mem=10g
#PBS -l walltime=00:10:00
module purge
module load PE-intel
module load QE
cd $PBS_O_WORKDIR
mpirun pw.x -in ../data/in
#!/bin/bash
#SBATCH -A <account_name>
#SBATCH -p <partition_name>
#SBATCH -N 2
#SBATCH -n 4
#SBATCH -c 1
#SBATCH --ntasks-per-node=2
#SBATCH -J multithread-test-job
#SBATCH --mem=1g
#SBATCH -t 10:00
#SBATCH -o ./%j-multithread-output.txt
#SBATCH -e ./%j-multithread-error.txt
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=<your_email>
srun ./HelloWorld.sh
#!/bin/bash
echo "Hello World! Node:${SLURMD_NODENAME} Core:${SLURM_PROCID}"
#!/bin/bash
#PBS -N spring-training-ex3
#PBS -A birthright-burst
#PBS -W group_list=cades-birthright
#PBS -q batch
#PBS -l qos=burst
#PBS -l nodes=1:ppn=32
#PBS -l mem=10g
#PBS -l walltime=00:10:00
#PBS -t 0-1%2
module purge
module load PE-intel
module load QE
cd $PBS_O_WORKDIR
input_files=(in in2)
mpirun pw.x -in "../data/${input_files[$PBS_ARRAY_ID]}"
#!/bin/bash
#SBATCH -A <account_name>
#SBATCH -p <partition_name>
#SBATCH -N 2
#SBATCH -n 4
#SBATCH -c 1
#SBATCH --ntasks-per-node=2
#SBATCH -J array-test-job
#SBATCH --mem=1g
#SBATCH -t 10:00
#SBATCH -o ./%A-%a-output.txt
#SBATCH -e ./%A-%a-multithread-error.txt
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=<your_email>
#SBATCH -a 0-1%2
#SBATCH --exclusive
srun ./HelloWorld.sh
#!/bin/bash
#SBATCH -J hello-world-example
#SBATCH -A birthright
#SBATCH -p testing
#SBATCH -N 1
#SBATCH -n 32
#SBATCH -c 1
#SBATCH --mem=10g
#SBATCH -t 00:10:00
#PBS -N spring-training-ex1
#PBS -A birthright
#PBS -W group_list=cades-birthright
#PBS -q testing
#PBS -l nodes=1:ppn=32
#PBS -l mem=10g
#PBS -l walltime=00:10:00
cd $SLURM_SUBMIT_DIR
echo “Hello World”
#!/bin/bash
#SBATCH -J spring-training-ex2
#SBATCH -A birthright
#SBATCH -p testing
#SBATCH -N 1
#SBATCH -n 32
#SBATCH -N 2
#SBATCH -n 4
#SBATCH -c 1
#SBATCH --mem=10g
#SBATCH -t 00:10:00
#SBATCH --ntasks-per-node=2
#SBATCH -J multithread-test-job
#SBATCH --mem=1g
#SBATCH -t 10:00
#SBATCH -o ./%j-multithread-output.txt
#SBATCH -e ./%j-multithread-error.txt
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=<your_email>
module purge
module load PE-intel
module load QE
srun ./HelloWorld.sh
cd $SLURM_SUBMIT_DIR
mpirun pw.x -in ../data/in
#!/bin/bash
#SBATCH -J spring-training-ex3
#SBATCH -A birthright
#SBATCH -p testing
#SBATCH -N 1
#SBATCH -n 32
#SBATCH -N 2
#SBATCH -n 4
#SBATCH -c 1
#SBATCH --mem=10g
#SBATCH -t 00:10:00
#SBATCH --ntasks-per-node=2
#SBATCH -J array-test-job
#SBATCH --mem=1g
#SBATCH -t 10:00
#SBATCH -o ./%A-%a-output.txt
#SBATCH -e ./%A-%a-multithread-error.txt
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=<your_email>
#SBATCH -a 0-1%2
#SBATCH --exclusive
module purge
module load PE-intel
module load QE
srun ./HelloWorld.sh
cd $SLURM_SUBMIT_DIR
input_files=(in in2)
mpirun pw.x -in "../data/${input_files[$SLURM_ARRAY_TASK_ID]}"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment