Always permute on the device
Created by: masterleinad
Running more profiling, it turned out that we benefit from CUDA
-aware MPI when we have a lot of data (which was the case for the test case I used so far). Otherwise, non-CUDA
-MPI seems to be better since the fewer copies in our code are not worth the longer runtime for the MPI calls.
While doing this, I discovered that a major difference in doPostsAndWaits
is due to the permutation being executed on the device for CUDA
-aware MPI. This pull request makes sure that we always perform the permutation on the device and only copy the data back to the host afterward.
Radius:
# MPI processes | old | new |
---|---|---|
1 | 2.34e0 | 1.80e0 |
6 | 2.77e0 | 2.27e0 |
12 | 2.78e0 | 2.23e0 |
24 | 2.89e0 | 2.32e0 |
48 | 2.91e0 | 2.36e0 |
96 | 2.93e0 | 2.34e0 |
192 | 2.96e0 | 2.37e0 |
384 | 2.97e0 | 2.44e0 |
768 | 3.02e0 | 2.42e0 |
knn:
# MPI processes | old | new |
---|---|---|
1 | 5.70e0 | 4.69e0 |
6 | 6.92e0 | 5.88e0 |
12 | 6.95e0 | 6.08e0 |
24 | 7.38e0 | 6.29e0 |
48 | 7.60e0 | 6.46e0 |
96 | 7.43e0 | 6.60e0 |
192 | 7.72e0 | 6.78e0 |
384 | 7.74e0 | 6.87e0 |
768 | 7.89e0 | 6.88e0 |