Update Cades authored by Doak, Peter W.'s avatar Doak, Peter W.
......@@ -3,6 +3,17 @@
This is the CNMS's newest computing resource.
## [Gaining Access](gaining_access)
Questions? [email](mailto:doakpw@ornl.gov).
If your usage is abusive, purposefully or not, and you do not respond promptly to queries your jobs will be held or killed.
## login
You will need to use you UCAMS password or if you are an XCAMS only user that password. This is not an RSA token login.
```shell-session
bash $ ssh or-slurm-login01.ornl.gov
[you@or-condo-login02 ~]$ module load env/cades-cnms
```
The [env/cades-cnms](env-cades-cnms) gives you some standard environment variables and puts the CNMS modules in your `$MODULEPATH`
# The facts
......@@ -24,74 +35,61 @@ You may notice some variation from this. Since experience frequent changes in nu
For up to the minute policies
```shell-session
mdiag -a cnms
[you@or-slurm-login01 ~]$ sacctmgr list Qos | grep cnms
```
Questions? [email](mailto:doakpw@ornl.gov).
If your usage is abusive, purposefully or not, and you do not respond promptly to queries your jobs will be held or killed.
## login
You will need to use you UCAMS password or if you are an XCAMS only user that password. This is not an RSA token login.
```shell-session
bash $ ssh or-condo-login.ornl.gov
[you@or-condo-login02 ~]$ module load env/cades-cnms
```
The [env/cades-cnms](env-cades-cnms) gives you some standard environment variables and puts the CNMS modules in your `$MODULEPATH`
### Check current qos=std condo with
```shell-session
[you@or-condo-login02 ~]$ showq -w "acct=cnms,class=batch"
[you@or-slurm-login01 ~]$ squeue -q cnms-batch
```
### Check current qos=std class=high_mem condo with
```shell-session
[you@or-condo-login02 ~]$ showq -w "acct=cnms,class=high_mem"
[you@or-slurm-login01 ~]$ squeue -q cnms-high_mem
```
## Environment
**32-36 core Haswell/Broadwell Based**
We have two guaranteed blocks of compute:
1 . 1216 on -q batch (includes both hw32 and bw36)
2 . 1200~ on -q high_mem (these are all bw36)
2 . 1152 on -q high_mem (these are all bw36)
* unless stated modules are optimized for hw32 but run just as well on bw36
* use of no feature code results in std
* high_mem and gpu nodes are now on separate queues. Neglect the feature code and use the correct queue
** MOAB Torque Cluster **
# CNMS CADES resources have moved to the slurm scheduler, Read Below!
using the old PBS headnode will just waste your time
** Slurm Cluster **
## Job Submission
### queues
There are now two **queues** for cnms jobs.
There are now two **partitions** for cnms jobs.
* batch
* high_mem
To use high_mem be sure to replace the -q batch with -q high_mem
To use high_mem be sure to replace the -p batch with -p high_mem
### Quality of service (QOS)
* std - generally this is what you want
* devel - short debug, build and experimental runs
* burst - premptable jobs that run on unused CONDO resources (you must request access from @epd)
If you need to run wide relatively short jobs, are experiencing long waits for std and can deal with them being occassionally prempted (i.e. killed) you can request access to qos: **burst** via [XCAMS](https://xcams.ornl.gov/xcams/groups/cades-cnms-burst)
This is the obligatory PBS header for a job.
### Basic PBS header -- [for CCSD](cades_ccsd)
This is the obligatory slurm header for a job.
### Basic job header -- [for CCSD](cades_ccsd)
``` shell
#!/bin/bash
#PBS -S /bin/bash
#PBS -N <YOUR_JOB_NAME>
#PBS -q <QUEUE_NAME>
#PBS -l nodes=2:ppn=32:hw32
#PBS -l walltime=00:30:00
#PBS -l naccesspolicy=singlejob
#PBS -A cnms
#PBS -W group_list=cades-cnms
#PBS -l qos=std
#!/bin/bash -l
#SBATCH -J test2
#SBATCH --nodes 2
#SBATCH --ntasks-per-node 32
#SBATCH --cpus-per-task 1
#SBATCH --exclusive
#SBATCH --mem=100g
#SBATCH -p batch
#SBATCH -A cnms
#SBATCH -t 00:30:00
```
## [Sample PBS Scripts](https://code.ornl.gov/CNMS/CNMS_Computing_Resources/blob/master/CADES) ##
## [Sample Slurm Scripts](https://code.ornl.gov/CNMS/CNMS_Computing_Resources/blob/master/CADES) ##
### So far only the VASP example is updated!
There are examples for most of the installed codes in the repo.
```shell
[you@or-condo-login02 ~]$ cd your-code-dir
......@@ -106,50 +104,30 @@ Run your jobs from
```
If your directory is missing ask @michael.galloway or another Cades admin in #general for it to be created.
The old lustre file system *pfs1* will be decommissioned and all data cleared in the near future. You must migrate you old data soon.
**Use a pbs job or an interactive job, do not use the login nodes.**
## Interactive Jobs
qsub -I -V -q batch -l walltime=02:00:00 -l nodes=2:ppn=32:hw32 -l qos=std -A cnms -W "group_list=cades-cnms" -N "Interactive"
```shell
salloc -A cnms -p batch -N 1 -n 32 -c 1 --mem=100G -t 04:00:00 srun --pty bash -i
```
unfortunately there's more to it than this if you expect to launch an mpi job interactively.
## CODES
These are the codes that have been installed so far. You can request additional codes.
Instructions for codes:
Please read these, you can waste a great deal of resources if you do not understand how to run even familiar codes optimally in this hardware environment.
[**VASP**](VASP) -- Be Careful with this one: The vanilla optimized version can experience matrix issues. If you have them try the 5.4.1.2 build or the debug build. This is a known issue with high optimizations and the intel compiler.
These are all being revised due to the slurm migration.
[**ESPRESSO**](ESPRESSO) -- Performs well on cades, consider it as an alternative to VASP
[**VASP**](VASP) -- Much greater care needs to be taken to get proper distribution of tasks with slurm, recompilation should eventually ease this.
[**LAMMPS**](LAMMPS) -- New build 24OCT17 patch
[**ESPRESSO**](ESPRESSO) -- Pending slurm instructions
[**ABINIT**](ABINIT) -- New build use caution
[**LAMMPS**](LAMMPS) -- Pending slurm instructions
[**ABINIT**](ABINIT) -- Pending slurm instructions
===
## Advanced
### Burst QOS
In theory there are two QOS levels useable on Cades:
```shell
#!/bin/bash
#PBS -S /bin/bash
#PBS -m be
#PBS -N nameofjob
#PBS -q batch
#PBS -l nodes=2:ppn=32
#PBS -l walltime=01:00:00
#PBS -A cnms-burst
#PBS -W group_list=cades-user
#PBS -l qos=burst
#PBS -l naccesspolicy=singlejob
export OMP_NUM_THREADS=1
cd $PBS_O_WORKDIR
module load env/cades-cnms
```
Burst QOS will work somewhat differently with slurm, see Cades docs.
The default action when this occurs is to resubmit the job. If your code cannot recover from a dirty halt this is method should not be used. In the near future it will be possible to alter this behavior.
......
......