Sorting Morton indices does not scale for small problem sizes

Created by: aprokop

The default bvh_driver parameters, OpenMP run.

OMP_NUM_THREADS=1

2.99e-01 sec 10.0% 98.6% 0.0% 96 ArborX:BVH:sort_morton_codes_and_init_leaves [region]                                                                                                         
|-> 3.82e-02 sec 1.3% 100.0% 0.0% 96 Kokkos::Sort::BinCount [for]                                                                                                                              
|-> 1.20e-01 sec 4.0% 100.0% 0.0% 96 Kokkos::Sort::BinBinning [for]                                                                                                                            
|-> 5.39e-02 sec 1.8% 100.0% 0.0% 96 Kokkos::Sort::BinSort [for]                                                                                                                               

OMP_NUM_THREADS=2

4.26e-01 sec 14.2% 98.9% 0.0% 93 ArborX:BVH:sort_morton_codes_and_init_leaves [region]                                                                                                         
|-> 1.13e-01 sec 3.8% 100.0% 0.0% 93 Kokkos::Sort::BinCount [for]                                                                                                                              
|-> 2.18e-01 sec 7.3% 100.0% 0.0% 93 Kokkos::Sort::BinBinning [for]                                                                                                                            
|-> 3.84e-02 sec 1.3% 100.0% 0.0% 93 Kokkos::Sort::BinSort [for]           

Note: the number of calls slightly different (96 vs 93).

BinCount is an order of magnitude slower, BinBinning twice as slow.