Initial Portable Performance of `AoSoA` and `parallel_for`

A preliminary analysis of the work in !7 (merged) has been performed to understand the effects on initial parallel loop constructs. The driver code can be found https://code.ornl.gov/CoPA/Cabana/blob/30062fd7b80547b460161b44cd87123c03212a7b/core/example/cuda_perf_test.cpp

The following system was used for this assessment:

Memory - 62.9 GiB
Processor - Intel Core i7-4930K CPU @ 3.40GHz x 12

And here is the output of the NVIDIA device query:

/usr/local/cuda-9.0/samples/1_Utilities/deviceQuery/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 770"
  CUDA Driver Version / Runtime Version          9.1 / 9.0
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 4035 MBytes (4231200768 bytes)
  ( 8) Multiprocessors, (192) CUDA Cores/MP:     1536 CUDA Cores
  GPU Max Clock rate:                            1189 MHz (1.19 GHz)
  Memory Clock rate:                             3505 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS

NOTE: When looking at the performance numbers below it is important to consider that the hardware threads are possibly competing with other tasks on the machine (e.g. my email) and the GPU was also driving my display and therefore also had competition. Also, when running CUDA kernels, the device is effectively "warmed up" by running the initialization kernel for the particles.

Serial, OpenMP, and CUDA results were assessed with 4 parallel loop constructs, all using native Kokkos capabilities:

StructParallel - Each struct operation is executed on an independent thread and a thread-local loop over the arrays is performed. This is the standard parallelism for Array-of-Structs
ArrayParallel - An outer loop over structs is performed and within each struct, a parallel loop over the arrays is performed. This is the standard parallelism for Struct-of-Arrays
StructAndArrayParallel Left - A 2D loop over the entire AoSoA is performed in parallel. The left-most index (struct index) moves the fastest.
StructAndArrayParallel Right - A 2D loop over the entire AoSoA is performed in parallel. The right-most index (array index) moves the fastest.

Each particle was given the following data:

    using DataTypes =
        Cabana::MemberDataTypes<double[3][3], // M1
                                double[3][3], // M2
                                double[3],    // V1
                                double[3],    // V2
                                double[3],    // RESULT
                                double,       // S1
                                double>;      // S2

The following work kernel was selected: RESULT = (M1 * V1) / S1 + (M2 * V2) / S2 + dot(V1,V2) * V1 + (M1 * M2) * V2

To begin an AoSoA of fixed size 1e7 was created and the inner array size of the data structure varied for each of the parallel loop body constructs. The following results were obtained:

Serial performance indicates that, except for very large inner array sizes, best performance is obtained with an Array-of-Structs approach. In this case small, inner array sizes, including 1, were effective at achieving performance. Slightly increasing the inner array size to O(100) gave some small improvements likely due to cache improvements. The 2D parallelism offered by the StructAndArrayParallel constructs was not effective or did not improve upon the Array-of-Structs approach.

Note for OpenMP results that no vectorization was performed for the given hardware. Results are similar to the Serial results with OpenMP working best for conventional threaded CPU hardware in an Array-of-Structs context. Some better performance could be had for this case with an inner array size of O(64).

CUDA results are rather dramatic and show the benefit of the multidimensional data layout as we can now take advantage of the CUDA thread hierarchy. Here, it is clear that Struct-of-Arrays is the best natural layout for the NVIDIA GPU if 1-dimensional parallelism is to be exploited. With 2D parallelism, particularly with the right-most index moving the fastest, even more performance can be gained. Here, the best performing inner array size was 32, the size of a warp on an NVIDIA GPU.

Next, algorithmic scaling was assessed using the best performing parameters by increasing the problem size:

Here we see that it takes a certain amount of work before the OpenMP and CUDA results begin to scale linearly. For problem sizes that are larger, the NVIDIA card on this system outperforms 12 threads with OpenMP. The following table gives speedup of the CUDA and OpenMP results relative to the serial result:

Problem Size	OpenMP	CUDA
1e2	0.65	0.30
1e3	1.69	0.82
1e4	5.56	4.05
1e5	4.69	7.88
1e6	5.29	9.44
1e7	6.36	9.28

Finally, OpenMP thread scaling was assessed vs. a serial computation for a fixed problem size of 1e7:

Threads	Time (ms)	Speedup
1	419	1.00
2	310	1.35
4	178	2.35
6	121	3.46
8	90	4.66
10	73	5.75
12	58	7.22

The major takeaways of this study are:

OpenMP runs (at least on this architecture) see no benefit from 2D parallelism. Struct-level parallelism in 1D is sufficient and smaller inner array sizes work well.
CUDA runs can greatly benefit from 2D parallelism (at least on this architecture). Array-level parallelism in 1D was significantly more effective than struct-level parallelism (as expected).
2D parallelism with CUDA not only gave the best performance but enabled small, warp-sized inner arrays. This is good news for dynamic memory management as our memory allocations for the AoSoA are only possible in chunks the size of the inner arrays. If we were forced to only use Struct-of-Arrays on the GPU we would be in bad shape for dynamic memory management.
This study needs to be repeated on Power8/Power9, P100/V100, and Intel many-core architectures to have any real value.

Edited Mar 22, 2018 by Slattery, Stuart

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information