Parallelize sortAndDetermineBufferLayout
Created by: masterleinad
In combination with the CUDA-aware MPI pull request (#162), we should also be able to avoid copying permutation_indices to the CPU.
Created by: masterleinad
In combination with the CUDA-aware MPI pull request (#162), we should also be able to avoid copying permutation_indices to the CPU.