CUDA Aware MPI with Pytorch
A user reported seeing this error using this installation procedure for pytorch. Have you been able to successfully run CUDA aware MPI examples on Summit using Pytorch?
Array to be scattered from rank 0 is
tensor([[ 0., 1., 2., 3., 4.],
[ 5., 6., 7., 8., 9.],
[10., 11., 12., 13., 14.],
[15., 16., 17., 18., 19.],
[20., 21., 22., 23., 24.],
[25., 26., 27., 28., 29.],
[30., 31., 32., 33., 34.],
[35., 36., 37., 38., 39.],
[40., 41., 42., 43., 44.],
[45., 46., 47., 48., 49.]], device='cuda:0')
Before Scatter: Rank 0 has
tensor([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]], device='cuda:0')
Before Scatter: Rank 1 has
tensor([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]], device='cuda:0')
Traceback (most recent call last):
File "/ccs/home/vgv/examples/pytorch/cuda-aware.py", line 42, in <module>
dist.scatter(my_A, src=0, scatter_list=list(A_chunk))
File "/gpfs/alpine/stf007/world-shared/vgv/inbox/amalik/summit/pytorch-1.0-p3/anaconda3/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1144, in scatter
work = _default_pg.scatter(output_tensors, input_tensors, opts)
RuntimeError: CUDA tensor detected and the MPI used doesn't have CUDA-aware MPI support
Traceback (most recent call last):
File "/ccs/home/vgv/examples/pytorch/cuda-aware.py", line 42, in <module>
dist.scatter(my_A, src=0, scatter_list=list(A_chunk))
File "/gpfs/alpine/stf007/world-shared/vgv/inbox/amalik/summit/pytorch-1.0-p3/anaconda3/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1144, in scatter
work = _default_pg.scatter(output_tensors, input_tensors, opts)
RuntimeError: CUDA tensor detected and the MPI used doesn't have CUDA-aware MPI support```