CADES CONDO CLUSTER
This is the CNMS's newest computing resource.
If your usage is abusive, purposefully or not, and you do not respond promptly to queries your jobs will be held or killed.
You will need to use you UCAMS password or if you are an XCAMS only user that password. Logging into CADES CONDO itself is not an RSA token login.
bash $ ssh or-slurm-login.ornl.gov [you@or-slurm-login01 ~]$ module load env/cades-cnms
The env/cades-cnms gives you some standard environment variables and puts the CNMS modules in your
If you have a UCAMS account and an RSA token you should be getting to cades via the VPN or
If you are using the VPN while it is up your experience should be just as onsite + some additional latency. This can make X11 forwarding require patience.
If you don't have an ORNL machine offsite to run the VPN on you'll need to come through
The best way to do this is by jumping through the login1 node.
$ ssh -X -J email@example.com firstname.lastname@example.org
to copy files
$ scp -J email@example.com your-uid@or-slurm-login:/path/to/files ./your/local/path
We own ~2400 cores. Many users must share these, think before you submit.
- Run test jobs to establish timing and parallelization parameters.
- Try to make your walltimes tight. (not always 48:00:00).
- Do not flood the queue with jobs, if you have many small jobs batch them. Ask how.
- We will actively thwart gaming the scheduling policies.
Subject to change at anytime
Walltime Limit: 48 hours
Simultaneous Jobs: 6
Max processors * remaining seconds running at anytime: 36495360 or 640 cores for ~15 hours.
You may notice some variation from this. Since experience frequent changes in number of users and intensity of use. Policies are adjusted to maximize utilization and responsiveness.
For up to the minute policies
[you@or-slurm-login01 ~]$ sacctmgr list Qos | grep cnms
Check current qos=std condo with
[you@or-slurm-login01 ~]$ squeue -q cnms-batch
Check current qos=std class=high_mem condo with
[you@or-slurm-login01 ~]$ squeue -q cnms-high_mem
32-36 core Haswell/Broadwell Based
We have two guaranteed blocks of compute: 1 . 1216 on -q batch (includes both hw32 and bw36) 2 . 1152 on -q high_mem (these are all bw36)
- unless stated modules are optimized for hw32 but run just as well on bw36
CNMS CADES resources have moved to the slurm scheduler, Read Below!
using the old PBS headnode will just waste your time
** Slurm Cluster **
There are now two partitions for cnms jobs.
- high_mem To use high_mem be sure to replace the -p batch with -p high_mem
Quality of service (QOS)
- std - generally this is what you want
- devel - short debug, build and experimental runs
- burst - premptable jobs that run on unused CONDO resources (you must request access from @epd)
This is the obligatory slurm header for a job.
#!/bin/bash -l #SBATCH -J test2 #SBATCH --nodes 2 #SBATCH --ntasks-per-node 32 #SBATCH --cpus-per-task 1 #SBATCH --exclusive #SBATCH --mem=100g #SBATCH -p batch #SBATCH -A cnms #SBATCH -t 00:30:00
So far only the VASP example is updated!
There are examples for most of the installed codes in the repo.
[you@or-condo-login02 ~]$ cd your-code-dir [you@or-condo-login02 ~]$ git clone firstname.lastname@example.org:CNMS/CNMS_Computing_Resources.git
You can contribute to the examples.
Run your jobs from
If your directory is missing ask @michael.galloway or another Cades admin in #general for it to be created.
salloc -A cnms -p batch --nodes=1 --mem=80G --exclusive -t 00:30:00
then wait. Try
-p high_mem if the wait is too long.
Then you can run jobs interactively by basically entering the commands in your submission script. If it fails you can correct and try again.
module load PE-gnu/3.0 module load nwchem/6.6_p3 srun --cpu-bind=cores nwchem input 2>&1 >nwchem_out & tail -f nwchem_out
These are the codes that have been installed so far. You can request additional codes.
Instructions for codes: These are all being revised due to the slurm migration.
VASP -- Much greater care needs to be taken to get proper distribution of tasks with slurm, recompilation should eventually ease this.
ESPRESSO -- Pending slurm instructions
LAMMPS -- Pending slurm instructions
ABINIT -- Pending slurm instructions
Burst QOS will work somewhat differently with slurm, see Cades docs.
The default action when this occurs is to resubmit the job. If your code cannot recover from a dirty halt this is method should not be used. In the near future it will be possible to alter this behavior.
You can contribute here