added fix.sh script and updated README (f1545b52) · Commits · olcf / hello_jobstep

README.md

+26 −6

Original line number	Diff line number	Diff line
		@@ -18,8 +18,6 @@ To compile, you'll need to have HIP and MPI installed, and you'll need to use an

		To run, simply launch the executable with your favorite job launcher. For example...

		> NOTE: Since there are 4 OpenMP threads per MPI rank, I've included `-c 8` to make sure each MPI rank has 4 physical CPU cores to spawn the 4 OpenMP threads on. The `-c` option counts hardware threads, not physical CPU cores (there are 2 hardware threads per physical core).

		```
		$ export OMP_NUM_THREADS=4
		$ srun -p mi100 -A stf016 -t 10 -N 2 -n 4 -c 8 --cpu-bind=cores --gpus-per-node=4 ./hello_jobstep \| sort
		@@ -41,15 +39,18 @@ MPI 3 - OMP 2 - HWT 81 - Node lyra17 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 -
		MPI 3 - OMP 3 - HWT 80 - Node lyra17 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c3,c6,a3,83
		```

		### Additional Notes
		> NOTE: Since there are 4 OpenMP threads per MPI rank, I've included `-c 8` to make sure each MPI rank has 4 physical CPU cores to spawn the 4 OpenMP threads on. The `-c` option counts hardware threads, not physical CPU cores (there are 2 hardware threads per physical core).

		> NOTE: If the output comes out garbled, you likely don't have `ROCR_VISIBLE_DEVICES` set. This can be set manually before running, or set implicitly with the `--gpus-per-node` flag or `--ntasks-per-gpu` flag (although the latter is currently broken - see below for work around). It is always recommended to add a `\| sort` at the end of the job step line for easier parsing (see some examples below).

		> NOTE: If the output comes out garbled, you likely don't have `ROCR_VISIBLE_DEVICES` set. This can be set manually before running, or set implicitly with the `--gpus-per-node` flag or `--ntasks-per-gpu` flag (although the latter is currently broken). It is always recommended to add a `\| sort` at the end of the job step line for easier parsing (see some examples below).
		### [OPTIONAL] `gpu_map.sh`

		> NOTE: There is a `gpu_map.sh` script included in the repo also. This can be run just before the `hello_jobstep` executable to map GPUs to node-local MPI tasks in a round-robin fashion.
		There is a `gpu_map.sh` script included in the repo also. This can be run just before the `hello_jobstep` executable to map GPUs to node-local MPI tasks in a round-robin fashion.

		For example...

		```
		$ export OMP_NUM_THREADS=1
		$ srun -p mi100 -A stf016 -t 10 -N 1 -n 6 --cpu-bind=cores ./gpu_map.sh ./hello_jobstep \| sort
		MPI 0 - OMP 0 - HWT 192 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID c3
		MPI 1 - OMP 0 - HWT 193 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID c6
		@@ -59,7 +60,11 @@ MPI 4 - OMP 0 - HWT 196 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID c3
		MPI 5 - OMP 0 - HWT 197 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID c6
		```

		> [OPTIONAL] An example mapping script is also included in this repo for an optional heavy-handed approach to process/thread mapping. It can be modifed and called "in front of" `hello_jobstep` (or any other executable really). The script uses `numactl` to map hardware threads and GPUs to node-local MPI ranks. NOTE: You should NOT use `--cpu-bind` with this script.
		### [OPTIONAL] `example_map.sh`

		An example mapping script is also included in this repo for an optional heavy-handed approach to process/thread mapping. It can be modifed and called "in front of" `hello_jobstep` (or any other executable really). The script uses `numactl` to map hardware threads and GPUs to node-local MPI ranks.

		> NOTE: You should NOT use `--cpu-bind` with this script. You also do not need to set `OMP_NUM_THREADS` since it is set in the script.

		For example...

		@@ -82,3 +87,18 @@ MPI 3 - OMP 1 - HWT 77 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 83
		MPI 3 - OMP 2 - HWT 78 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 83
		MPI 3 - OMP 3 - HWT 79 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 83
		```

		### [OPTIONAL] `fix.sh`

		As mentioned above, the `--ntasks-per-gpu` flag is currently broken. As a work around, you can use the flag with this script run in front of the executable. It simply unsets `CUDA_VISIBLE_DEVICES`, which somehow interferes with the `ROCM_VISIBLE_DEVICES` environment variable that this flag sets. For example...

		```
		$ export OMP_NUM_THREADS=1
		$ srun -p mi100 -A stf016 -t 10 -N 1 -n 2 --cpu-bind=cores --gpus-per-node=4 --ntasks-per-gpu=1 ./fix.sh ./hello_jobstep \| sort
		MPI 0 - OMP 0 - HWT 192 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID c3
		MPI 1 - OMP 0 - HWT 193 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID c6
		MPI 2 - OMP 0 - HWT 194 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID a3
		MPI 3 - OMP 0 - HWT 195 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 83
		```

		> NOTE: This would have failed without the `fix.sh` script.

fix.sh

0 → 100755

+4 −0

Original line number	Diff line number	Diff line
		#!/bin/bash

		unset CUDA_VISIBLE_DEVICES
		exec $*