CADES CONDO CLUSTER
This is the CNMS's newest computing resource.
Gaining Access
Questions? email.
If your usage is abusive, purposefully or not, and you do not respond promptly to queries your jobs will be held or killed.
login
You will need to use you UCAMS password or if you are an XCAMS only user that password. Logging into CADES CONDO itself is not an RSA token login.
Onsite login
bash $ ssh or-slurm-login.ornl.gov
[you@or-slurm-login01 ~]$ module load env/cades-cnms
The env/cades-cnms gives you some standard environment variables and puts the CNMS modules in your $MODULEPATH
Offsite login
If you have a UCAMS account and an RSA token you should be getting to cades via the VPN or login1.ornl.gov
If you are using the VPN while it is up your experience should be just as onsite + some additional latency. This can make X11 forwarding require patience.
If you don't have an ORNL machine offsite to run the VPN on you'll need to come through login1.ornl.gov
The best way to do this is by jumping through the login1 node.
$ ssh -X -J your-uid@login1.ornl.gov your-uid@or-slurm-login.ornl.gov
to copy files
$ scp -J your-uid@login1.ornl.gov your-uid@or-slurm-login:/path/to/files ./your/local/path
The facts
We own ~2400 cores. Many users must share these, think before you submit.
- Run test jobs to establish timing and parallelization parameters.
- Try to make your walltimes tight. (not always 48:00:00).
- Do not flood the queue with jobs, if you have many small jobs batch them. Ask how.
- We will actively thwart gaming the scheduling policies.
Policies
Subject to change at anytime
Walltime Limit: 48 hours
Simultaneous Jobs: 6
Max processors * remaining seconds running at anytime: 36495360 or 640 cores for ~15 hours.
You may notice some variation from this. Since experience frequent changes in number of users and intensity of use. Policies are adjusted to maximize utilization and responsiveness.
For up to the minute policies
[you@or-slurm-login01 ~]$ sacctmgr list Qos | grep cnms
Check current qos=std condo with
[you@or-slurm-login01 ~]$ squeue -q cnms-batch
Check current qos=std class=high_mem condo with
[you@or-slurm-login01 ~]$ squeue -q cnms-high_mem
Environment
32-36 core Haswell/Broadwell Based
We have two guaranteed blocks of compute: 1 . 1216 on -q batch (includes both hw32 and bw36) 2 . 1152 on -q high_mem (these are all bw36)
- unless stated modules are optimized for hw32 but run just as well on bw36
CNMS CADES resources have moved to the slurm scheduler, Read Below!
using the old PBS headnode will just waste your time
** Slurm Cluster **
Job Submission
queues
There are now two partitions for cnms jobs.
- batch
- high_mem To use high_mem be sure to replace the -p batch with -p high_mem
Quality of service (QOS)
- std - generally this is what you want
- devel - short debug, build and experimental runs
- burst - premptable jobs that run on unused CONDO resources (you must request access from @epd)
This is the obligatory slurm header for a job.
for CCSD
Basic job header --#!/bin/bash -l
#SBATCH -J test2
#SBATCH --nodes 2
#SBATCH --ntasks-per-node 32
#SBATCH --cpus-per-task 1
#SBATCH --exclusive
#SBATCH --mem=100g
#SBATCH -p batch
#SBATCH -A cnms
#SBATCH -t 00:30:00
Sample Slurm Scripts
So far only the VASP example is updated!
There are examples for most of the installed codes in the repo.
[you@or-condo-login02 ~]$ cd your-code-dir
[you@or-condo-login02 ~]$ git clone git@code.ornl.gov:CNMS/CNMS_Computing_Resources.git
You can contribute to the examples.
File System
Run your jobs from
/lustre/or-hydra/cades-cnms/you
If your directory is missing ask @michael.galloway or another Cades admin in #general for it to be created.
Interactive Jobs
salloc -A cnms -p batch --nodes=1 --mem=80G --exclusive -t 00:30:00
then wait. Try -p high_mem
if the wait is too long.
Then you can run jobs interactively by basically entering the commands in your submission script. If it fails you can correct and try again.
examples
nwchem
module load PE-gnu/3.0
module load nwchem/6.6_p3
srun --cpu-bind=cores nwchem input 2>&1 >nwchem_out &
tail -f nwchem_out
CODES
These are the codes that have been installed so far. You can request additional codes.
Instructions for codes: These are all being revised due to the slurm migration.
VASP -- Much greater care needs to be taken to get proper distribution of tasks with slurm, recompilation should eventually ease this.
ESPRESSO -- Pending slurm instructions
LAMMPS -- Pending slurm instructions
ABINIT -- Pending slurm instructions
Advanced
Burst QOS will work somewhat differently with slurm, see Cades docs.
The default action when this occurs is to resubmit the job. If your code cannot recover from a dirty halt this is method should not be used. In the near future it will be possible to alter this behavior.
Bench Marking
You can contribute here