... | ... | @@ -3,6 +3,17 @@ |
|
|
This is the CNMS's newest computing resource.
|
|
|
|
|
|
## [Gaining Access](gaining_access)
|
|
|
Questions? [email](mailto:doakpw@ornl.gov).
|
|
|
|
|
|
If your usage is abusive, purposefully or not, and you do not respond promptly to queries your jobs will be held or killed.
|
|
|
|
|
|
## login
|
|
|
You will need to use you UCAMS password or if you are an XCAMS only user that password. This is not an RSA token login.
|
|
|
```shell-session
|
|
|
bash $ ssh or-slurm-login01.ornl.gov
|
|
|
[you@or-condo-login02 ~]$ module load env/cades-cnms
|
|
|
```
|
|
|
The [env/cades-cnms](env-cades-cnms) gives you some standard environment variables and puts the CNMS modules in your `$MODULEPATH`
|
|
|
|
|
|
|
|
|
# The facts
|
... | ... | @@ -24,74 +35,61 @@ You may notice some variation from this. Since experience frequent changes in nu |
|
|
|
|
|
For up to the minute policies
|
|
|
```shell-session
|
|
|
mdiag -a cnms
|
|
|
[you@or-slurm-login01 ~]$ sacctmgr list Qos | grep cnms
|
|
|
```
|
|
|
|
|
|
Questions? [email](mailto:doakpw@ornl.gov).
|
|
|
|
|
|
If your usage is abusive, purposefully or not, and you do not respond promptly to queries your jobs will be held or killed.
|
|
|
|
|
|
## login
|
|
|
You will need to use you UCAMS password or if you are an XCAMS only user that password. This is not an RSA token login.
|
|
|
```shell-session
|
|
|
bash $ ssh or-condo-login.ornl.gov
|
|
|
[you@or-condo-login02 ~]$ module load env/cades-cnms
|
|
|
```
|
|
|
The [env/cades-cnms](env-cades-cnms) gives you some standard environment variables and puts the CNMS modules in your `$MODULEPATH`
|
|
|
|
|
|
### Check current qos=std condo with
|
|
|
```shell-session
|
|
|
[you@or-condo-login02 ~]$ showq -w "acct=cnms,class=batch"
|
|
|
[you@or-slurm-login01 ~]$ squeue -q cnms-batch
|
|
|
```
|
|
|
### Check current qos=std class=high_mem condo with
|
|
|
```shell-session
|
|
|
[you@or-condo-login02 ~]$ showq -w "acct=cnms,class=high_mem"
|
|
|
[you@or-slurm-login01 ~]$ squeue -q cnms-high_mem
|
|
|
```
|
|
|
|
|
|
|
|
|
## Environment
|
|
|
**32-36 core Haswell/Broadwell Based**
|
|
|
|
|
|
We have two guaranteed blocks of compute:
|
|
|
1 . 1216 on -q batch (includes both hw32 and bw36)
|
|
|
2 . 1200~ on -q high_mem (these are all bw36)
|
|
|
2 . 1152 on -q high_mem (these are all bw36)
|
|
|
|
|
|
* unless stated modules are optimized for hw32 but run just as well on bw36
|
|
|
* use of no feature code results in std
|
|
|
* high_mem and gpu nodes are now on separate queues. Neglect the feature code and use the correct queue
|
|
|
|
|
|
** MOAB Torque Cluster **
|
|
|
# CNMS CADES resources have moved to the slurm scheduler, Read Below!
|
|
|
using the old PBS headnode will just waste your time
|
|
|
|
|
|
** Slurm Cluster **
|
|
|
## Job Submission
|
|
|
### queues
|
|
|
There are now two **queues** for cnms jobs.
|
|
|
There are now two **partitions** for cnms jobs.
|
|
|
* batch
|
|
|
* high_mem
|
|
|
To use high_mem be sure to replace the -q batch with -q high_mem
|
|
|
To use high_mem be sure to replace the -p batch with -p high_mem
|
|
|
|
|
|
### Quality of service (QOS)
|
|
|
* std - generally this is what you want
|
|
|
* devel - short debug, build and experimental runs
|
|
|
* burst - premptable jobs that run on unused CONDO resources (you must request access from @epd)
|
|
|
|
|
|
If you need to run wide relatively short jobs, are experiencing long waits for std and can deal with them being occassionally prempted (i.e. killed) you can request access to qos: **burst** via [XCAMS](https://xcams.ornl.gov/xcams/groups/cades-cnms-burst)
|
|
|
|
|
|
This is the obligatory PBS header for a job.
|
|
|
### Basic PBS header -- [for CCSD](cades_ccsd)
|
|
|
This is the obligatory slurm header for a job.
|
|
|
### Basic job header -- [for CCSD](cades_ccsd)
|
|
|
``` shell
|
|
|
#!/bin/bash
|
|
|
#PBS -S /bin/bash
|
|
|
#PBS -N <YOUR_JOB_NAME>
|
|
|
#PBS -q <QUEUE_NAME>
|
|
|
#PBS -l nodes=2:ppn=32:hw32
|
|
|
#PBS -l walltime=00:30:00
|
|
|
#PBS -l naccesspolicy=singlejob
|
|
|
#PBS -A cnms
|
|
|
#PBS -W group_list=cades-cnms
|
|
|
#PBS -l qos=std
|
|
|
#!/bin/bash -l
|
|
|
#SBATCH -J test2
|
|
|
#SBATCH --nodes 2
|
|
|
#SBATCH --ntasks-per-node 32
|
|
|
#SBATCH --cpus-per-task 1
|
|
|
#SBATCH --exclusive
|
|
|
#SBATCH --mem=100g
|
|
|
#SBATCH -p batch
|
|
|
#SBATCH -A cnms
|
|
|
#SBATCH -t 00:30:00
|
|
|
```
|
|
|
|
|
|
## [Sample PBS Scripts](https://code.ornl.gov/CNMS/CNMS_Computing_Resources/blob/master/CADES) ##
|
|
|
## [Sample Slurm Scripts](https://code.ornl.gov/CNMS/CNMS_Computing_Resources/blob/master/CADES) ##
|
|
|
|
|
|
### So far only the VASP example is updated!
|
|
|
There are examples for most of the installed codes in the repo.
|
|
|
```shell
|
|
|
[you@or-condo-login02 ~]$ cd your-code-dir
|
... | ... | @@ -106,50 +104,30 @@ Run your jobs from |
|
|
```
|
|
|
If your directory is missing ask @michael.galloway or another Cades admin in #general for it to be created.
|
|
|
|
|
|
The old lustre file system *pfs1* will be decommissioned and all data cleared in the near future. You must migrate you old data soon.
|
|
|
**Use a pbs job or an interactive job, do not use the login nodes.**
|
|
|
|
|
|
## Interactive Jobs
|
|
|
qsub -I -V -q batch -l walltime=02:00:00 -l nodes=2:ppn=32:hw32 -l qos=std -A cnms -W "group_list=cades-cnms" -N "Interactive"
|
|
|
```shell
|
|
|
salloc -A cnms -p batch -N 1 -n 32 -c 1 --mem=100G -t 04:00:00 srun --pty bash -i
|
|
|
```
|
|
|
unfortunately there's more to it than this if you expect to launch an mpi job interactively.
|
|
|
|
|
|
## CODES
|
|
|
These are the codes that have been installed so far. You can request additional codes.
|
|
|
|
|
|
Instructions for codes:
|
|
|
Please read these, you can waste a great deal of resources if you do not understand how to run even familiar codes optimally in this hardware environment.
|
|
|
|
|
|
[**VASP**](VASP) -- Be Careful with this one: The vanilla optimized version can experience matrix issues. If you have them try the 5.4.1.2 build or the debug build. This is a known issue with high optimizations and the intel compiler.
|
|
|
These are all being revised due to the slurm migration.
|
|
|
|
|
|
[**ESPRESSO**](ESPRESSO) -- Performs well on cades, consider it as an alternative to VASP
|
|
|
[**VASP**](VASP) -- Much greater care needs to be taken to get proper distribution of tasks with slurm, recompilation should eventually ease this.
|
|
|
|
|
|
[**LAMMPS**](LAMMPS) -- New build 24OCT17 patch
|
|
|
[**ESPRESSO**](ESPRESSO) -- Pending slurm instructions
|
|
|
|
|
|
[**ABINIT**](ABINIT) -- New build use caution
|
|
|
[**LAMMPS**](LAMMPS) -- Pending slurm instructions
|
|
|
|
|
|
[**ABINIT**](ABINIT) -- Pending slurm instructions
|
|
|
===
|
|
|
|
|
|
## Advanced
|
|
|
|
|
|
### Burst QOS
|
|
|
In theory there are two QOS levels useable on Cades:
|
|
|
|
|
|
```shell
|
|
|
#!/bin/bash
|
|
|
#PBS -S /bin/bash
|
|
|
#PBS -m be
|
|
|
#PBS -N nameofjob
|
|
|
#PBS -q batch
|
|
|
#PBS -l nodes=2:ppn=32
|
|
|
#PBS -l walltime=01:00:00
|
|
|
#PBS -A cnms-burst
|
|
|
#PBS -W group_list=cades-user
|
|
|
#PBS -l qos=burst
|
|
|
#PBS -l naccesspolicy=singlejob
|
|
|
export OMP_NUM_THREADS=1
|
|
|
|
|
|
cd $PBS_O_WORKDIR
|
|
|
module load env/cades-cnms
|
|
|
```
|
|
|
Burst QOS will work somewhat differently with slurm, see Cades docs.
|
|
|
|
|
|
The default action when this occurs is to resubmit the job. If your code cannot recover from a dirty halt this is method should not be used. In the near future it will be possible to alter this behavior.
|
|
|
|
... | ... | |