Skip to content

Always permute on the device

Created by: masterleinad

Running more profiling, it turned out that we benefit from CUDA-aware MPI when we have a lot of data (which was the case for the test case I used so far). Otherwise, non-CUDA-MPI seems to be better since the fewer copies in our code are not worth the longer runtime for the MPI calls.

While doing this, I discovered that a major difference in doPostsAndWaits is due to the permutation being executed on the device for CUDA-aware MPI. This pull request makes sure that we always perform the permutation on the device and only copy the data back to the host afterward.

Radius:

# MPI processes old new
1 2.34e0 1.80e0
6 2.77e0 2.27e0
12 2.78e0 2.23e0
24 2.89e0 2.32e0
48 2.91e0 2.36e0
96 2.93e0 2.34e0
192 2.96e0 2.37e0
384 2.97e0 2.44e0
768 3.02e0 2.42e0

knn:

# MPI processes old new
1 5.70e0 4.69e0
6 6.92e0 5.88e0
12 6.95e0 6.08e0
24 7.38e0 6.29e0
48 7.60e0 6.46e0
96 7.43e0 6.60e0
192 7.72e0 6.78e0
384 7.74e0 6.87e0
768 7.89e0 6.88e0

Merge request reports