Commit 2d328e1c authored by Papatheodore, Thomas's avatar Papatheodore, Thomas
Browse files

updated README and example_map.sh script

parent 6cdf9254
Loading
Loading
Loading
Loading
+42 −9
Original line number Diff line number Diff line
@@ -6,19 +6,52 @@ For each job step launched with a job launcher, this program prints the hardware

To compile, you'll need to have HIP and MPI installed, and you'll need to use an OpenMP-capable compiler. Modify the Makefile accordingly.

### MPI + Compiler + HIP Combinations
### Included MPI + Compiler + HIP Combinations

<b>CrayMPI + Cray Clang + ROCm  --> Makefile.crayMPI.crayClang</b>
* Requires ROCm <= v3.8 due to incompatibilities with the latest Cray compilers
<b>hipcc + OpenMPI</b>

<b>CrayMPI + hipcc + ROCm       --> Makefile.crayMPI.hipcc</b>

<b>OpenMPI + hipcc + ROCm       --> Makefile.openMPI.hipcc</b>
<b>CC + CrayMPI</b>

## Usage

To run, simply launch the executable with your favorite job launcher. 

> NOTE: `HIP_VISIBLE_DEVICES` must be set.

> [OPTIONAL] An example mapping script is also included in this repo for an optional heavy-handed approach to process/thread mapping. It can be modifed and called "in front of" `hello_jobstep` (or any other executable really). The script uses `numactl` to map hardware threads and GPUs to node-local MPI ranks. NOTE: You will need to use the `srun` argument `--ntasks_per_gpu` with this script.
> NOTE: If the output comes out garbled, you likely don't have `ROCR_VISIBLE_DEVICES` set. This can be set manually before running, or set implicitly with the `--gpus-per-node` flag or `--ntasks-per-gpu` flag (although the latter is currently broken). It is always recommended to add a `| sort` at the end of the job step line for easier parsing (see some examples below).

> NOTE: There is a `gpu_map.sh` script included in the repo also. This can be run just before the `hello_jobstep` executable to map GPUs to node-local MPI tasks in a round-robin fashion. 

For example...

```
$ srun -p mi100 -A stf016 -t 10 -N 1 -n 6 --cpu-bind=cores ./gpu_map.sh ./hello_jobstep | sort
MPI   0 - OMP   0 - HWT 192 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID c3
MPI   1 - OMP   0 - HWT 193 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID c6
MPI   2 - OMP   0 - HWT 194 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID a3
MPI   3 - OMP   0 - HWT 195 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 83
MPI   4 - OMP   0 - HWT 196 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID c3
MPI   5 - OMP   0 - HWT 197 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID c6
```

> [OPTIONAL] An example mapping script is also included in this repo for an optional heavy-handed approach to process/thread mapping. It can be modifed and called "in front of" `hello_jobstep` (or any other executable really). The script uses `numactl` to map hardware threads and GPUs to node-local MPI ranks. NOTE: You should NOT use `--cpu-bind` with this script.

For example...

```
$ srun -p mi100 -A stf016 -t 10 -N 1 -n 4 ./example_map.sh ./hello_jobstep | sort
MPI   0 - OMP   0 - HWT  64 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID c3
MPI   0 - OMP   1 - HWT  65 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID c3
MPI   0 - OMP   2 - HWT  66 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID c3
MPI   0 - OMP   3 - HWT  67 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID c3
MPI   1 - OMP   0 - HWT  68 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID c6
MPI   1 - OMP   1 - HWT  69 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID c6
MPI   1 - OMP   2 - HWT  70 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID c6
MPI   1 - OMP   3 - HWT  71 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID c6
MPI   2 - OMP   0 - HWT  72 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID a3
MPI   2 - OMP   1 - HWT  73 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID a3
MPI   2 - OMP   2 - HWT  74 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID a3
MPI   2 - OMP   3 - HWT  75 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID a3
MPI   3 - OMP   0 - HWT  76 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 83
MPI   3 - OMP   1 - HWT  77 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 83
MPI   3 - OMP   2 - HWT  78 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 83
MPI   3 - OMP   3 - HWT  79 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 83
```
+6 −14
Original line number Diff line number Diff line
#!/bin/bash

#------------------------------------------------------
# Set the executable name from the first command line 
# Sets the executable name from the first command line 
# argument to this script
#
# NOTE: You'll need to read in more command line args 
@@ -9,14 +9,6 @@
#------------------------------------------------------
APP=$1

#------------------------------------------------------
# Set the number of node-local MPI ranks
#
# NOTE: The `--ntasks-per-node` flag to srun must be 
# used to set SLURM_NTASKS_PER_NODE.
#------------------------------------------------------
lrank=$(($SLURM_PROCID % $SLURM_NTASKS_PER_NODE))

#------------------------------------------------------
# OpenMP environment variables
#
@@ -38,24 +30,24 @@ export OMP_PLACES=cores
# NOTE: For more than 4 MPI ranks per node, 
# additional cases would need to be added.
#------------------------------------------------------
case ${lrank} in
case ${SLURM_LOCALID} in
[0])
export HIP_VISIBLE_DEVICES=0
export ROCR_VISIBLE_DEVICES=0
numactl --physcpubind=64,65,66,67 $APP
  ;;

[1])
export HIP_VISIBLE_DEVICES=1
export ROCR_VISIBLE_DEVICES=1
numactl --physcpubind=68,69,70,71 $APP
  ;;

[2])
export HIP_VISIBLE_DEVICES=2
export ROCR_VISIBLE_DEVICES=2
numactl --physcpubind=72,73,74,75 $APP
  ;;

[3])
export HIP_VISIBLE_DEVICES=3
export ROCR_VISIBLE_DEVICES=3
numactl --physcpubind=76,77,78,79 $APP
  ;;