> NOTE: Since there are 4 OpenMP threads per MPI rank, I've included `-c 8` to make sure each MPI rank has 4 physical CPU cores to spawn the 4 OpenMP threads on. The `-c` option counts hardware threads, not physical CPU cores (there are 2 hardware threads per physical core).
> NOTE: If the output comes out garbled, you likely don't have `ROCR_VISIBLE_DEVICES` set. This can be set manually before running, or set implicitly with the `--gpus-per-node` flag or `--ntasks-per-gpu` flag (although the latter is currently broken - see below for work around). It is always recommended to add a `| sort` at the end of the job step line for easier parsing (see some examples below).
> NOTE: RT_GPU_ID shows the "HIP runtime" numbering of the GPUs and GPU_ID shows the "node-level" numbering of the GPUs. The node-level numbering is what you would intuitively think of - the values 0, 1, 2, and 3 on a Lyra node - but as labeled by the HIP runtime, the GPUs visible to each MPI rank are numbered starting from 0. So if 2 MPI ranks have 2 GPUs available to them, MPI 0 might have GPUs 0 and 1 and MPI 1 might have GPUs 2 and 3. But the HIP runtime (as seen from `hipGetDevice`) will show GPU IDs 0 and 1 for both MPI ranks. This is not to say that both MPI ranks have access to the same 2 GPUs; just that the runtime labels the GPUs this way. In fact, the Bus ID for each GPU can be used to verify the MPI ranks do, in fact, have access to different GPUs.
> NOTE: If the value of GPU_ID is reported as "N/A", the environment varible `ROCR_VISIBLE_DEVICES` is not set. The program will still run fine without it - it's really only there to try to capture the node-level GPU IDs rather than the runtime GPU IDs. But the Bus IDs can be used to verify different GPUs. If desired though, `ROCR_VISIBLE_DEVICES` can be set manually before running, or set implicitly with the `--gpus-per-node` flag or `--ntasks-per-gpu` flag (although the latter is currently broken - see below for work around). It is always recommended to add a `| sort` at the end of the job step line for easier parsing (see some examples below).