Skip to content

Use MPI for torch distributed bootstrap

Glaser, Jens requested to merge try_fix_hang into clip_wciscc2024

There appears to be some TCP connection issue on ascent preventing distributed initialization when certain hosts (h49n01) are involved. This seems to result from a timing issue in the torch C++ distributed backend (1.10).

This patch uses mpi4py to setup torch.distributed "manually", which relies on a collective and seems to solve the timing problem.

Since MPI is now being initialized (which loads ibverbs in spectrum MPI), we have to enable fork safe mode (IBV_FORK_SAFE) to support dataloaders.

Edited by Glaser, Jens

Merge request reports

Loading