Use MPI for torch distributed bootstrap (!3) · Merge requests · Glaser, Jens / affinity_clip

Glaser, Jens requested to merge try_fix_hang into clip_wciscc2024 Feb 16, 2024

There appears to be some TCP connection issue on ascent preventing distributed initialization when certain hosts (h49n01) are involved. This seems to result from a timing issue in the torch C++ distributed backend (1.10).

This patch uses mpi4py to setup torch.distributed "manually", which relies on a collective and seems to solve the timing problem.

Since MPI is now being initialized (which loads ibverbs in spectrum MPI), we have to enable fork safe mode (IBV_FORK_SAFE) to support dataloaders.

Edited Feb 16, 2024 by Glaser, Jens

Admin message

Use MPI for torch distributed bootstrap

Merge request reports