removed unnecessary scripts and updated README (49f8252d) · Commits · olcf / hello_jobstep

Makefile.CC-cmpi→Makefile

+0 −0

File moved.

Makefile.openMPI.hipcc

deleted100644 → 0

+0 −16

Original line number	Diff line number	Diff line
		COMP = hipcc
		FLAGS = --amdgpu-target=gfx906,gfx908 -fopenmp

		INCLUDES = -I$(OLCF_OPENMPI_ROOT)/include
		LIBRARIES = -L$(OLCF_OPENMPI_ROOT)/lib -lmpi

		hello_jobstep: hello_jobstep.o
		$(COMP) $(FLAGS) $(LIBRARIES) hello_jobstep.o -o hello_jobstep

		hello_jobstep.o: hello_jobstep.cpp
		$(COMP) $(FLAGS) $(INCLUDES) -c hello_jobstep.cpp

		.PHONY: clean

		clean:
		rm -f hello_jobstep *.o

README.md

+29 −91

Original line number	Diff line number	Diff line
		# hello_jobstep

		For each job step launched with a job launcher, this program prints the hardware thread IDs that each MPI rank and OpenMP thread runs on, and the GPU IDs that each rank/thread has access to.
		For each job step launched with srun, this program prints the hardware thread IDs that each MPI rank and OpenMP thread runs on, and the GPU IDs that each rank/thread has access to.

		## Compiling

		To compile, you'll need to have HIP and MPI installed, and you'll need to use an OpenMP-capable compiler. Modify the Makefile accordingly.

		### Included MPI + Compiler + HIP Combinations

		* hipcc + OpenMPI
		### Included Compiler + MPI + HIP Combinations

		* CC + CrayMPI

		> NOTE: When using Cray's MPI, you must set `export MV2_ENABLE_AFFINITY=0` to properly use Slurm's binding flags. Otherwise, the Cray MPI binding will take precedence and might give unexpected/undesired results.

		## Usage

		To run, simply launch the executable with your favorite job launcher. For example...
		To run, set the `OMP_NUM_THREADS` environment variable and launch the executable with `srun`. For example...

		```
		$ export OMP_NUM_THREADS=4
		$ srun -p mi100 -A stf016 -t 10 -N 2 -n 4 -c 8 --cpu-bind=cores --gpus-per-node=4 ./hello_jobstep \| sort
		MPI 0 - OMP 0 - HWT 195 - Node lyra16 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c3,c6,a3,83
		MPI 0 - OMP 1 - HWT 66 - Node lyra16 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c3,c6,a3,83
		MPI 0 - OMP 2 - HWT 65 - Node lyra16 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c3,c6,a3,83
		MPI 0 - OMP 3 - HWT 64 - Node lyra16 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c3,c6,a3,83
		MPI 1 - OMP 0 - HWT 199 - Node lyra16 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c3,c6,a3,83
		MPI 1 - OMP 1 - HWT 70 - Node lyra16 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c3,c6,a3,83
		MPI 1 - OMP 2 - HWT 69 - Node lyra16 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c3,c6,a3,83
		MPI 1 - OMP 3 - HWT 68 - Node lyra16 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c3,c6,a3,83
		MPI 2 - OMP 0 - HWT 195 - Node lyra17 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c3,c6,a3,83
		MPI 2 - OMP 1 - HWT 66 - Node lyra17 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c3,c6,a3,83
		MPI 2 - OMP 2 - HWT 65 - Node lyra17 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c3,c6,a3,83
		MPI 2 - OMP 3 - HWT 64 - Node lyra17 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c3,c6,a3,83
		MPI 3 - OMP 0 - HWT 211 - Node lyra17 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c3,c6,a3,83
		MPI 3 - OMP 1 - HWT 208 - Node lyra17 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c3,c6,a3,83
		MPI 3 - OMP 2 - HWT 81 - Node lyra17 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c3,c6,a3,83
		MPI 3 - OMP 3 - HWT 80 - Node lyra17 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c3,c6,a3,83
		```

		> NOTE: Since there are 4 OpenMP threads per MPI rank, I've included `-c 8` to make sure each MPI rank has 4 physical CPU cores to spawn the 4 OpenMP threads on. The `-c` option counts hardware threads, not physical CPU cores (there are 2 hardware threads per physical core).

		> NOTE: RT_GPU_ID shows the "HIP runtime" numbering of the GPUs and GPU_ID shows the "node-level" numbering of the GPUs. The node-level numbering is what you would intuitively think of - the values 0, 1, 2, and 3 on a Lyra node - but as labeled by the HIP runtime, the GPUs visible to each MPI rank are numbered starting from 0. So if 2 MPI ranks have 2 GPUs available to them, MPI 0 might have GPUs 0 and 1 and MPI 1 might have GPUs 2 and 3. But the HIP runtime (as seen from `hipGetDevice`) will show GPU IDs 0 and 1 for both MPI ranks. This is not to say that both MPI ranks have access to the same 2 GPUs; just that the runtime labels the GPUs this way. In fact, the Bus ID for each GPU can be used to verify the MPI ranks do, in fact, have access to different GPUs.

		> NOTE: If the value of GPU_ID is reported as "N/A", the environment varible `ROCR_VISIBLE_DEVICES` is not set. The program will still run fine without it - it's really only there to try to capture the node-level GPU IDs rather than the runtime GPU IDs. But the Bus IDs can be used to verify different GPUs. If desired though, `ROCR_VISIBLE_DEVICES` can be set manually before running, or set implicitly with the `--gpus-per-node` flag or `--ntasks-per-gpu` flag (although the latter is currently broken - see below for work around). It is always recommended to add a `\| sort` at the end of the job step line for easier parsing (see some examples below).

		### [OPTIONAL] `gpu_map.sh`

		There is a `gpu_map.sh` script included in the repo also. This can be run just before the `hello_jobstep` executable to map GPUs to node-local MPI tasks in a round-robin fashion.

		For example...

		```
		$ export OMP_NUM_THREADS=1
		$ srun -p mi100 -A stf016 -t 10 -N 1 -n 6 --cpu-bind=cores ./gpu_map.sh ./hello_jobstep \| sort
		MPI 0 - OMP 0 - HWT 192 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID c3
		MPI 1 - OMP 0 - HWT 193 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID c6
		MPI 2 - OMP 0 - HWT 194 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID a3
		MPI 3 - OMP 0 - HWT 195 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 83
		MPI 4 - OMP 0 - HWT 196 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID c3
		MPI 5 - OMP 0 - HWT 197 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID c6
		```

		### [OPTIONAL] `example_map.sh`

		An example mapping script is also included in this repo for an optional heavy-handed approach to process/thread mapping. It can be modifed and called "in front of" `hello_jobstep` (or any other executable really). The script uses `numactl` to map hardware threads and GPUs to node-local MPI ranks.

		> NOTE: You should NOT use `--cpu-bind` with this script. You also do not need to set `OMP_NUM_THREADS` since it is set in the script.

		For example...

		```
		$ srun -p mi100 -A stf016 -t 10 -N 1 -n 4 ./example_map.sh ./hello_jobstep \| sort
		MPI 0 - OMP 0 - HWT 64 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID c3
		MPI 0 - OMP 1 - HWT 65 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID c3
		MPI 0 - OMP 2 - HWT 66 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID c3
		MPI 0 - OMP 3 - HWT 67 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID c3
		MPI 1 - OMP 0 - HWT 68 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID c6
		MPI 1 - OMP 1 - HWT 69 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID c6
		MPI 1 - OMP 2 - HWT 70 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID c6
		MPI 1 - OMP 3 - HWT 71 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID c6
		MPI 2 - OMP 0 - HWT 72 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID a3
		MPI 2 - OMP 1 - HWT 73 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID a3
		MPI 2 - OMP 2 - HWT 74 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID a3
		MPI 2 - OMP 3 - HWT 75 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID a3
		MPI 3 - OMP 0 - HWT 76 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 83
		MPI 3 - OMP 1 - HWT 77 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 83
		MPI 3 - OMP 2 - HWT 78 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 83
		MPI 3 - OMP 3 - HWT 79 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 83
		```

		### [OPTIONAL] `fix.sh`

		As mentioned above, the `--ntasks-per-gpu` flag is currently broken. As a work around, you can use the flag with this script run in front of the executable. It simply unsets `CUDA_VISIBLE_DEVICES`, which somehow interferes with the `ROCM_VISIBLE_DEVICES` environment variable that this flag sets. For example...

		```
		$ export OMP_NUM_THREADS=1
		$ srun -p mi100 -A stf016 -t 10 -N 1 -n 2 --cpu-bind=cores --gpus-per-node=4 --ntasks-per-gpu=1 ./fix.sh ./hello_jobstep \| sort
		MPI 0 - OMP 0 - HWT 192 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID c3
		MPI 1 - OMP 0 - HWT 193 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID c6
		MPI 2 - OMP 0 - HWT 194 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID a3
		MPI 3 - OMP 0 - HWT 195 - Node lyra14 - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID 83
		```

		> NOTE: This would have failed without the `fix.sh` script.
		$ srun -A stf016 -t 10 -N 2 -n 4 -c 4 --threads-per-core=1 --gpus-per-node=4 ./hello_jobstep \| sort
		MPI 000 - OMP 000 - HWT 000 - Node spock01 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c9,87,48,09
		MPI 000 - OMP 001 - HWT 001 - Node spock01 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c9,87,48,09
		MPI 000 - OMP 002 - HWT 002 - Node spock01 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c9,87,48,09
		MPI 000 - OMP 003 - HWT 003 - Node spock01 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c9,87,48,09
		MPI 001 - OMP 000 - HWT 016 - Node spock01 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c9,87,48,09
		MPI 001 - OMP 001 - HWT 017 - Node spock01 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c9,87,48,09
		MPI 001 - OMP 002 - HWT 018 - Node spock01 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c9,87,48,09
		MPI 001 - OMP 003 - HWT 019 - Node spock01 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c9,87,48,09
		MPI 002 - OMP 000 - HWT 000 - Node spock13 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c9,87,48,09
		MPI 002 - OMP 001 - HWT 001 - Node spock13 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c9,87,48,09
		MPI 002 - OMP 002 - HWT 002 - Node spock13 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c9,87,48,09
		MPI 002 - OMP 003 - HWT 003 - Node spock13 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c9,87,48,09
		MPI 003 - OMP 000 - HWT 016 - Node spock13 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c9,87,48,09
		MPI 003 - OMP 001 - HWT 017 - Node spock13 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c9,87,48,09
		MPI 003 - OMP 002 - HWT 018 - Node spock13 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c9,87,48,09
		MPI 003 - OMP 003 - HWT 019 - Node spock13 - RT_GPU_ID 0,1,2,3 - GPU_ID 0,1,2,3 - Bus_ID c9,87,48,09
		```

		The different GPU IDs reported by the example program are:

		* `GPU_ID` is the node-level (or global) GPU ID read from `ROCR_VISIBLE_DEVICES`. If this environment variable is not set (either by the user or by Slurm), the value of `GPU_ID` will be set to `N/A`.
		* `RT_GPU_ID` is the HIP runtime GPU ID (as reported from, say `hipGetDevice`).
		* `Bus_ID` is the physical bus ID associated with the GPUs. Comparing the bus IDs is meant to definitively show that different GPUs are being used.

		> NOTE: Although the two GPU IDs (`GPU_ID` and `RT_GPU_ID`) are the same in the example above, they do not have to be. See the Spock Quick-Start Guide for such examples.

example_map.sh

deleted100755 → 0

+0 −54

Original line number	Diff line number	Diff line
		#!/bin/bash

		#------------------------------------------------------
		# Sets the executable name from the first command line
		# argument to this script
		#
		# NOTE: You'll need to read in more command line args
		# if your executable takes arguments
		#------------------------------------------------------
		APP=$1

		#------------------------------------------------------
		# OpenMP environment variables
		#
		# NOTE: If you change the number of OpenMP threads,
		# you will also need to change the --physcpubind
		# values below. The values given are hardware thread
		# IDs, so if you want 1 OpenMP thread per physical
		# core, look at the Lyra node diagram and make sure
		# to use only 1 hw thread per physical core for each
		# comma-separated value.
		#------------------------------------------------------
		export OMP_NUM_THREADS=4
		export OMP_PLACES=cores

		#------------------------------------------------------
		# Set hardware thread IDs and GPUs for each node-local
		# MPI rank.
		#
		# NOTE: For more than 4 MPI ranks per node,
		# additional cases would need to be added.
		#------------------------------------------------------
		case ${SLURM_LOCALID} in
		[0])
		export ROCR_VISIBLE_DEVICES=0
		numactl --physcpubind=64,65,66,67 $APP
		;;

		[1])
		export ROCR_VISIBLE_DEVICES=1
		numactl --physcpubind=68,69,70,71 $APP
		;;

		[2])
		export ROCR_VISIBLE_DEVICES=2
		numactl --physcpubind=72,73,74,75 $APP
		;;

		[3])
		export ROCR_VISIBLE_DEVICES=3
		numactl --physcpubind=76,77,78,79 $APP
		;;

		esac

fix.sh

deleted100755 → 0

+0 −4

Original line number	Diff line number	Diff line
		#!/bin/bash

		unset CUDA_VISIBLE_DEVICES
		exec $*