Always permute on the device (!222) · Merge requests · Arndt, Daniel / ArborX

Created by: masterleinad

Running more profiling, it turned out that we benefit from CUDA-aware MPI when we have a lot of data (which was the case for the test case I used so far). Otherwise, non-CUDA-MPI seems to be better since the fewer copies in our code are not worth the longer runtime for the MPI calls.

While doing this, I discovered that a major difference in doPostsAndWaits is due to the permutation being executed on the device for CUDA-aware MPI. This pull request makes sure that we always perform the permutation on the device and only copy the data back to the host afterward.

Radius:

# MPI processes	old	new
1	2.34e0	1.80e0
6	2.77e0	2.27e0
12	2.78e0	2.23e0
24	2.89e0	2.32e0
48	2.91e0	2.36e0
96	2.93e0	2.34e0
192	2.96e0	2.37e0
384	2.97e0	2.44e0
768	3.02e0	2.42e0

knn:

# MPI processes	old	new
1	5.70e0	4.69e0
6	6.92e0	5.88e0
12	6.95e0	6.08e0
24	7.38e0	6.29e0
48	7.60e0	6.46e0
96	7.43e0	6.60e0
192	7.72e0	6.78e0
384	7.74e0	6.87e0
768	7.89e0	6.88e0

Always permute on the device

Merge request reports