Update Cades authored by Doak, Peter W.'s avatar Doak, Peter W.
...@@ -3,6 +3,17 @@ ...@@ -3,6 +3,17 @@
This is the CNMS's newest computing resource. This is the CNMS's newest computing resource.
## [Gaining Access](gaining_access) ## [Gaining Access](gaining_access)
Questions? [email](mailto:doakpw@ornl.gov).
If your usage is abusive, purposefully or not, and you do not respond promptly to queries your jobs will be held or killed.
## login
You will need to use you UCAMS password or if you are an XCAMS only user that password. This is not an RSA token login.
```shell-session
bash $ ssh or-slurm-login01.ornl.gov
[you@or-condo-login02 ~]$ module load env/cades-cnms
```
The [env/cades-cnms](env-cades-cnms) gives you some standard environment variables and puts the CNMS modules in your `$MODULEPATH`
# The facts # The facts
...@@ -24,74 +35,61 @@ You may notice some variation from this. Since experience frequent changes in nu ...@@ -24,74 +35,61 @@ You may notice some variation from this. Since experience frequent changes in nu
For up to the minute policies For up to the minute policies
```shell-session ```shell-session
mdiag -a cnms [you@or-slurm-login01 ~]$ sacctmgr list Qos | grep cnms
``` ```
Questions? [email](mailto:doakpw@ornl.gov).
If your usage is abusive, purposefully or not, and you do not respond promptly to queries your jobs will be held or killed.
## login
You will need to use you UCAMS password or if you are an XCAMS only user that password. This is not an RSA token login.
```shell-session
bash $ ssh or-condo-login.ornl.gov
[you@or-condo-login02 ~]$ module load env/cades-cnms
```
The [env/cades-cnms](env-cades-cnms) gives you some standard environment variables and puts the CNMS modules in your `$MODULEPATH`
### Check current qos=std condo with ### Check current qos=std condo with
```shell-session ```shell-session
[you@or-condo-login02 ~]$ showq -w "acct=cnms,class=batch" [you@or-slurm-login01 ~]$ squeue -q cnms-batch
``` ```
### Check current qos=std class=high_mem condo with ### Check current qos=std class=high_mem condo with
```shell-session ```shell-session
[you@or-condo-login02 ~]$ showq -w "acct=cnms,class=high_mem" [you@or-slurm-login01 ~]$ squeue -q cnms-high_mem
``` ```
## Environment ## Environment
**32-36 core Haswell/Broadwell Based** **32-36 core Haswell/Broadwell Based**
We have two guaranteed blocks of compute: We have two guaranteed blocks of compute:
1 . 1216 on -q batch (includes both hw32 and bw36) 1 . 1216 on -q batch (includes both hw32 and bw36)
2 . 1200~ on -q high_mem (these are all bw36) 2 . 1152 on -q high_mem (these are all bw36)
* unless stated modules are optimized for hw32 but run just as well on bw36 * unless stated modules are optimized for hw32 but run just as well on bw36
* use of no feature code results in std
* high_mem and gpu nodes are now on separate queues. Neglect the feature code and use the correct queue
** MOAB Torque Cluster ** # CNMS CADES resources have moved to the slurm scheduler, Read Below!
using the old PBS headnode will just waste your time
** Slurm Cluster **
## Job Submission ## Job Submission
### queues ### queues
There are now two **queues** for cnms jobs. There are now two **partitions** for cnms jobs.
* batch * batch
* high_mem * high_mem
To use high_mem be sure to replace the -q batch with -q high_mem To use high_mem be sure to replace the -p batch with -p high_mem
### Quality of service (QOS) ### Quality of service (QOS)
* std - generally this is what you want * std - generally this is what you want
* devel - short debug, build and experimental runs * devel - short debug, build and experimental runs
* burst - premptable jobs that run on unused CONDO resources (you must request access from @epd) * burst - premptable jobs that run on unused CONDO resources (you must request access from @epd)
If you need to run wide relatively short jobs, are experiencing long waits for std and can deal with them being occassionally prempted (i.e. killed) you can request access to qos: **burst** via [XCAMS](https://xcams.ornl.gov/xcams/groups/cades-cnms-burst) This is the obligatory slurm header for a job.
### Basic job header -- [for CCSD](cades_ccsd)
This is the obligatory PBS header for a job.
### Basic PBS header -- [for CCSD](cades_ccsd)
``` shell ``` shell
#!/bin/bash #!/bin/bash -l
#PBS -S /bin/bash #SBATCH -J test2
#PBS -N <YOUR_JOB_NAME> #SBATCH --nodes 2
#PBS -q <QUEUE_NAME> #SBATCH --ntasks-per-node 32
#PBS -l nodes=2:ppn=32:hw32 #SBATCH --cpus-per-task 1
#PBS -l walltime=00:30:00 #SBATCH --exclusive
#PBS -l naccesspolicy=singlejob #SBATCH --mem=100g
#PBS -A cnms #SBATCH -p batch
#PBS -W group_list=cades-cnms #SBATCH -A cnms
#PBS -l qos=std #SBATCH -t 00:30:00
``` ```
## [Sample PBS Scripts](https://code.ornl.gov/CNMS/CNMS_Computing_Resources/blob/master/CADES) ## ## [Sample Slurm Scripts](https://code.ornl.gov/CNMS/CNMS_Computing_Resources/blob/master/CADES) ##
### So far only the VASP example is updated!
There are examples for most of the installed codes in the repo. There are examples for most of the installed codes in the repo.
```shell ```shell
[you@or-condo-login02 ~]$ cd your-code-dir [you@or-condo-login02 ~]$ cd your-code-dir
...@@ -106,50 +104,30 @@ Run your jobs from ...@@ -106,50 +104,30 @@ Run your jobs from
``` ```
If your directory is missing ask @michael.galloway or another Cades admin in #general for it to be created. If your directory is missing ask @michael.galloway or another Cades admin in #general for it to be created.
The old lustre file system *pfs1* will be decommissioned and all data cleared in the near future. You must migrate you old data soon.
**Use a pbs job or an interactive job, do not use the login nodes.**
## Interactive Jobs ## Interactive Jobs
qsub -I -V -q batch -l walltime=02:00:00 -l nodes=2:ppn=32:hw32 -l qos=std -A cnms -W "group_list=cades-cnms" -N "Interactive" ```shell
salloc -A cnms -p batch -N 1 -n 32 -c 1 --mem=100G -t 04:00:00 srun --pty bash -i
```
unfortunately there's more to it than this if you expect to launch an mpi job interactively.
## CODES ## CODES
These are the codes that have been installed so far. You can request additional codes. These are the codes that have been installed so far. You can request additional codes.
Instructions for codes: Instructions for codes:
Please read these, you can waste a great deal of resources if you do not understand how to run even familiar codes optimally in this hardware environment. These are all being revised due to the slurm migration.
[**VASP**](VASP) -- Be Careful with this one: The vanilla optimized version can experience matrix issues. If you have them try the 5.4.1.2 build or the debug build. This is a known issue with high optimizations and the intel compiler.
[**ESPRESSO**](ESPRESSO) -- Performs well on cades, consider it as an alternative to VASP [**VASP**](VASP) -- Much greater care needs to be taken to get proper distribution of tasks with slurm, recompilation should eventually ease this.
[**LAMMPS**](LAMMPS) -- New build 24OCT17 patch [**ESPRESSO**](ESPRESSO) -- Pending slurm instructions
[**ABINIT**](ABINIT) -- New build use caution [**LAMMPS**](LAMMPS) -- Pending slurm instructions
[**ABINIT**](ABINIT) -- Pending slurm instructions
=== ===
## Advanced ## Advanced
### Burst QOS Burst QOS will work somewhat differently with slurm, see Cades docs.
In theory there are two QOS levels useable on Cades:
```shell
#!/bin/bash
#PBS -S /bin/bash
#PBS -m be
#PBS -N nameofjob
#PBS -q batch
#PBS -l nodes=2:ppn=32
#PBS -l walltime=01:00:00
#PBS -A cnms-burst
#PBS -W group_list=cades-user
#PBS -l qos=burst
#PBS -l naccesspolicy=singlejob
export OMP_NUM_THREADS=1
cd $PBS_O_WORKDIR
module load env/cades-cnms
```
The default action when this occurs is to resubmit the job. If your code cannot recover from a dirty halt this is method should not be used. In the near future it will be possible to alter this behavior. The default action when this occurs is to resubmit the job. If your code cannot recover from a dirty halt this is method should not be used. In the near future it will be possible to alter this behavior.
... ...
......