Commit 1da7715f authored by Prout, Ryan's avatar Prout, Ryan
Browse files

large update to docs

parent 889e8027
**New to the ARM Cumulus System?**
Welcome! The information below introduces how we structure user
accounts, projects, and system allocations. It's all you need to know
about getting to work.
.. _getting_started:
.. _overview:
**New to the ARM Cumulus System?**
Getting Started
Welcome! The information below introduces how we structure user
accounts, projects, and system allocations. It's all you need to know
about getting to work.
Getting Started
.. toctree::
:maxdepth: 2
\ No newline at end of file
Welcome! The information below introduces how we structure user
accounts, projects, and system allocations. It's all you need to know
about getting to work.
The below projects are allocted time on Cumulus.
* CLI120 - LASSO
* ATM118
.. _Account Application:
Applying For An Account
Cumulus access is granted to users through the, online, `OLCF application form <>`__.
Within the application, please specify **"Open CLI120"** as the project (or **"Open ATM118"** if that is what you are applying to). It is important to reference **"Open"**
because Cumulus is in the NCCS Open Enclave.
**NOTE**: If you need access to both projects, two applications will need to be submitted (one for each project).
System Access
System access is granted after the above steps are complete, outlined in the :ref:`Account Application` section. After completing the application process,
you can learn more about the system on the :ref:`Cumulus_Overview` page. Within that, you can see how to login and use Cumulus.
Please reach out to **** if you need assistance with your application.
......@@ -9,6 +9,9 @@ Welcome to ARM HPC documentation!
.. toctree::
:maxdepth: 2
.. _introduction:
.. toctree::
:maxdepth: 2
\ No newline at end of file
Atomospheric Radiation Measurement (ARM) HPC Facility
ARM data users who need more storage capacity and computational power can apply for direct access to ARM computing resources and data.
This public software development space enables users to work with large volumes of ARM data without having to download them.
The ARM computing clusters are available to ARM Facility science users who work with very high volumes of ARM data.
They have the capability to support model simulations, petascale data storage, and big-data analytics for successful ARM science research.
The Cumulus cluster is a mid-range Cray system with 4,032 processing cores and a 2 petabyte general parallel file server.
It is primarily used for “high-end modeling” and supports routine operations of Large-Eddy Simulation (LES) ARM Symbiotic Simulation and Observation (LASSO).
\ No newline at end of file
.. _Cumulus_Overview:
System Overview
Cumulus is a Cray® XC40™ cluster with Intel Broadwell processors and Cray’s Aries interconnect in a network topology called Dragonfly.
Each node has (128) GB of DDR3 SDRAM and (2) sockets with (18) physical cores each.
The system has (2) external login nodes and (112) compute nodes.
To log into Cumulus, you will use your XCAMS/UCAMS username and password.
**NOTE**: If you do not have an account yet, you can follow the steps on this page: :ref:`Account Application`
Data Storage and Filesystems
Cumulus mounts the OLCF open enclave file systems. The available filesystems are summarized in the table below.
.. list-table::
:widths: 25 25 25 25 25 25
:header-rows: 1
* - Filesystem
- Mount Points
- Backed Up?
- Purged?
- Quota?
- Comments
* - Home Directories (NFS)
- /ccsopen/home/$USER
- Yes
- No
- 50GB
* - Project Space (NFS)
- /ccsopen/proj/$PROJECT_ID
- Yes
- No
- 50GB
* - Parallel Scratch (GPFS)
- /gpfs/wolf/proj-shared/$PROJECT_ID
- No
- Yes
Programming Environments
Software Modules
The software environment is managed through the **Environmental Module** tool.
.. list-table::
:widths: 25 25
:header-rows: 1
* - Command
- Description
* - module list
- Lists modules currently loaded in a user’s environment
* - module avail
- Lists all available modules on a system in condensed format
* - module avail -l
- Lists all available modules on a system in long format
* - module display
- Shows environment changes that will be made by loading a given module
* - module load <module-name>
- Loads a module
* - module help
- Shows help for a module
* - module swap
- Swaps a currently loaded module for an unloaded module
After logging in, you can see the modules loaded by default with ``module list``.
The following compilers are available on Cumulus:
* Intel Composer XE (default)
* Portland Group Compiler Suite
* GNU Compiler Collection
* Cray Compiling Environment
**NOTE**: Upon login, the default versions of the Intel compiler and associated Message Passing Interface (MPI) libraries
are added to each user’s environment through a programming environment module.
Users do not need to make any environment changes to use the default version of Intel and MPI.
Compiler Environments
If a different compiler is required, it is important to use the correct environment for each compiler.
To aid users in pairing the correct compiler and environment, programming environment modules are provided.
The programming environment modules will load the correct pairing of compiler version, message passing libraries,
and other items required to build and run. We highly recommend that the programming environment modules be used when changing compiler vendors.
The following programming environment modules are available:
* PrgEnv-intel (default)
* PrgEnv-pgi
* PrgEnv-gnu
* PrgEnv-cray
To change the default loaded Intel environment to the default GCC environment use:
``module unload PrgEnv-intel`` and then ``module load PrgEnv-gnu``
Or, alternativley, use the swap command:
``module swap PrgEnv-intel PrgEnv-gnu``
Managing Compiler Versions
To use a specific compiler version, you must first ensure the compiler’s PrgEnv module is loaded,
and then swap to the correct compiler version. For example, the following will configure the
environment to use the GCC compilers, then load a non-default GCC compiler version:
``module swap PrgEnv-intel PrgEnv-gnu``
``module swap gcc gcc/4.6.1``
Compiler Commands
As is the case with our other Cray systems, the C, C++, and Fortran compilers are invoked with the following commands:
* For the C compiler: ``cc``
* For the C++ compiler: ``CC``
* For the Fortran compiler: ``ftn``
These are actually compiler wrappers that automatically link in appropriate libraries (such as MPI and math libraries)
and build code that targets the compute-node processor architecture. These wrappers should be used regardless of the
underlying compiler (Intel, PGI, GNU, or Cray).
**NOTE**: You should not call the vendor compilers (i.e pgf90, icpc, gcc) directly. Commands such as mpicc, mpiCC, and mpif90
are not available on Cray systems. You should use cc, CC, and ftn instead.
General Guidelines
We recommend the following general guidelines for using the programming environment modules:
* Do not purge all modules; rather, use the default module environment provided at the time of login, and modify it.
* Do not swap or unload any of the Cray provided modules (those with names like ``xt-*``, ``xe-*``, ``xk-*``, or ``cray-*``).
Threaded Codes
When building threaded codes on Cray machines, you may need to take additional steps to ensure a proper build.
For Intel, use the ``openmp`` option:
.. code-block:: bash
$ cc -openmp test.c -o test.x
$ setenv OMP_NUM_THREADS 2
For PGI, add ``-mp`` to the build line:
.. code-block:: bash
$ module swap PrgEnv-intel PrgEnv-pgi
$ cc -mp test.c -o test.x
$ setenv OMP_NUM_THREADS 2
For GNU, add ``-fopenmp`` to the build line:
.. code-block:: bash
$ module swap PrgEnv-intel PrgEnv-gnu
$ cc -fopenmp test.c -o test.x
$ setenv OMP_NUM_THREADS 2
For Cray, no additional flags are required:
.. code-block:: bash
$ module swap PrgEnv-intel PrgEnv-cray
$ cc test.c -o test.x
$ setenv OMP_NUM_THREADS 2
Running Jobs
In High Performance Computing (HPC), computational work is performed by jobs. Individual jobs produce data that lend relevant insight into grand challenges in science and engineering. As such, the timely, efficient execution of jobs is the primary concern in the operation of any HPC system.
A job on a commodity cluster typically comprises a few different components:
* A batch submission script.
* A binary executable.
* A set of input files for the executable.
* A set of output files created by the executable.
And the process for running a job, in general, is to:
#. Prepare executables and input files.
#. Write a batch script.
#. Submit the batch script to the batch scheduler.
#. Optionally monitor the job before and during execution.
The following sections describe in detail how to create, submit, and manage jobs for execution on commodity clusters.
Login vs Compute Nodes
When you log into an OLCF cluster, you are placed on a login node. Login node resources are shared by all users of the system.
Because of this, users should be mindful when performing tasks on a login node.
Login nodes should be used for basic tasks such as file editing, code compilation, data backup, and job submission.
Login nodes should not be used for memory- or compute-intensive tasks. Users should also limit the number of simultaneous tasks
performed on the login resources. For example, a user should not run (10) simultaneous tar processes on a login node.
**NOTE**: Compute-intensive, memory-intensive, or otherwise disruptive processes running on login nodes may be killed without warning.
Batch Scheduler
Cumulus utilizes the Slurm batch scheduler. The following sections look at Slurm interaction in more detail.
Writing Batch Scripts
Batch scripts, or job submission scripts, are the mechanism by which a user configures and submits a job for execution. A batch script is simply a shell script that also includes commands to be interpreted by the batch scheduling software (e.g. Slurm).
Batch scripts are submitted to the batch scheduler, where they are then parsed for the scheduling configuration options. The batch scheduler then places the script in the appropriate queue, where it is designated as a batch job. Once the batch jobs makes its way through the queue, the script will be executed on the primary compute node of the allocated resources.
Example Batch Script
.. code-block:: bash
#SBATCH -J test
#SBATCH -t 1:00:00
srun -n 8 ./a.out
**Interpreter Line**
1: This line is optional and can be used to specify a shell to intrepret the script. In this example, the bash shell will be used.
**Slurm Options**
2: The job will be charged to the "XXXYYY" project.
3: The job will be named test.
4: The job will request (2) nodes.
5: The job will request (1) hour walltime.
**Shell Commands**
6: This line is left blank, so it will be ignored.
7: This command will change the current directory to the directory from where the script was submitted.
8: This command will run the date command.
9: This command will run (8) MPI instances of the executable a.out on the compute nodes allocated by the batch system.
Submitting a Batch Script
Batch scripts can be submitted for execution using the sbatch command. For example, the following will submit the batch script named test.slurm:
``sbatch test.slurm``
If successfully submitted, a Slurm job ID will be returned. This ID can be used to track the job. It is also helpful in troubleshooting
a failed job; make a note of the job ID for each of your jobs in case you must contact the OLCF User Assistance Center for support.
Common Batch Options for Slurm
.. list-table::
:widths: 25 25 15
:header-rows: 1
* - Option
- Use
- Description
* - ``-A``
- #SBATCH -A <account>
- Causes the job time to be charged to <account>.
* - ``-N``
- #SBATCH -N <value>
- Number of compute nodes to allocate. Jobs cannot request partial nodes.
* - ``-t``
- #SBATCH -t <time>
- Maximum wall-clock time. <time> is in the format HH:MM:SS.
* - ``-p``
- #SBATCH -p <partition_name>
- Allocates resources on specified partition.
* - ``-o``
- #SBATCH -o <filename>
- Writes standard output to <fileame>
* - ``-e``
- #SBATCH -e <filename>
- Writes standard error to <filename>
* - ``--mail-type``
- #SBATCH --mail-type=FAIL
- Sends email to the submitter when the job fails.
* -
- #SBATCH --mail-type=BEGIN
- Sends email to the submitter when the job begins.
* -
- #SBATCH --mail-type=END
- Sends email to the submitter when the job ends.
* - ``--mail-user``
- #SBATCH --mail-user=<address>
- Specifies email address to use for --mail-type options.
* - ``-J``
- #SBATCH -J <name>
- Sets the job name to <name> instead of the name of the job script.
* - ``--mem=0``
- #SBATCH --mem=0
- Declare to use all the available memory of the node
Interactive Jobs
Batch scripts are useful when one has a pre-determined group of commands to execute, the results of which can be viewed at a later time.
However, it is often necessary to run tasks on compute resources interactively.
Users are not allowed to access cluster compute nodes directly from a login node. Instead, users must use an interactive
batch job to allocate and gain access to compute resources. This is done by using the Slurm salloc command. Other Slurm options are passed to salloc on the command line as well:
``$ salloc -A abc123 -p research -N 4 -t 1:00:00``
This request will:
.. list-table::
:widths: 25 25
:header-rows: 0
* - ``salloc``
- Start an interactive session
* - ``-A``
- Charge to the ``abc123`` project
* - ``-p research``
- Run inthe ``research`` partition
* - ``-N 4``
- request (4) nodes
* - ``-t 1:00:00``
- ...for (1) hour
After running this command, the job will wait until enough compute nodes are available, just as any other batch job must.
However, once the job starts, the user will be given an interactive prompt on the primary compute node within the allocated resource pool.
Commands may then be executed directly (instead of through a batch script).
A common use of interactive batch is to aid in debugging efforts. interactive access to compute resources allows the ability
to run a process to the point of failure; however, unlike a batch job, the process can be restarted after brief changes are made without
losing the compute resource pool; thus speeding up the debugging effort.
.. _cumulus_1:
.. toctree::
:maxdepth: 2
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment