Initial Portable Performance of `AoSoA` and `parallel_for`
A preliminary analysis of the work in !7 (merged) has been performed to understand the effects on initial parallel loop constructs. The driver code can be found https://code.ornl.gov/CoPA/Cabana/blob/30062fd7b80547b460161b44cd87123c03212a7b/core/example/cuda_perf_test.cpp
The following system was used for this assessment:
- Memory - 62.9 GiB
- Processor - Intel Core i7-4930K CPU @ 3.40GHz x 12
And here is the output of the NVIDIA device query:
/usr/local/cuda-9.0/samples/1_Utilities/deviceQuery/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 770"
CUDA Driver Version / Runtime Version 9.1 / 9.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 4035 MBytes (4231200768 bytes)
( 8) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores
GPU Max Clock rate: 1189 MHz (1.19 GHz)
Memory Clock rate: 3505 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS
NOTE: When looking at the performance numbers below it is important to consider that the hardware threads are possibly competing with other tasks on the machine (e.g. my email) and the GPU was also driving my display and therefore also had competition. Also, when running CUDA kernels, the device is effectively "warmed up" by running the initialization kernel for the particles.
Serial, OpenMP, and CUDA results were assessed with 4 parallel loop constructs, all using native Kokkos capabilities:
- StructParallel - Each struct operation is executed on an independent thread and a thread-local loop over the arrays is performed. This is the standard parallelism for Array-of-Structs
- ArrayParallel - An outer loop over structs is performed and within each struct, a parallel loop over the arrays is performed. This is the standard parallelism for Struct-of-Arrays
-
StructAndArrayParallel Left - A 2D loop over the entire
AoSoA
is performed in parallel. The left-most index (struct index) moves the fastest. -
StructAndArrayParallel Right - A 2D loop over the entire
AoSoA
is performed in parallel. The right-most index (array index) moves the fastest.
Each particle was given the following data:
using DataTypes =
Cabana::MemberDataTypes<double[3][3], // M1
double[3][3], // M2
double[3], // V1
double[3], // V2
double[3], // RESULT
double, // S1
double>; // S2
The following work kernel was selected: RESULT = (M1 * V1) / S1 + (M2 * V2) / S2 + dot(V1,V2) * V1 + (M1 * M2) * V2
To begin an AoSoA
of fixed size 1e7
was created and the inner array size of the data structure varied for each of the parallel loop body constructs. The following results were obtained:
Serial performance indicates that, except for very large inner array sizes, best performance is obtained with an Array-of-Structs approach. In this case small, inner array sizes, including 1, were effective at achieving performance. Slightly increasing the inner array size to O(100) gave some small improvements likely due to cache improvements. The 2D parallelism offered by the StructAndArrayParallel constructs was not effective or did not improve upon the Array-of-Structs approach.
Note for OpenMP results that no vectorization was performed for the given hardware. Results are similar to the Serial results with OpenMP working best for conventional threaded CPU hardware in an Array-of-Structs context. Some better performance could be had for this case with an inner array size of O(64).
CUDA results are rather dramatic and show the benefit of the multidimensional data layout as we can now take advantage of the CUDA thread hierarchy. Here, it is clear that Struct-of-Arrays is the best natural layout for the NVIDIA GPU if 1-dimensional parallelism is to be exploited. With 2D parallelism, particularly with the right-most index moving the fastest, even more performance can be gained. Here, the best performing inner array size was 32, the size of a warp on an NVIDIA GPU.
Next, algorithmic scaling was assessed using the best performing parameters by increasing the problem size:
Here we see that it takes a certain amount of work before the OpenMP and CUDA results begin to scale linearly. For problem sizes that are larger, the NVIDIA card on this system outperforms 12 threads with OpenMP. The following table gives speedup of the CUDA and OpenMP results relative to the serial result:
Problem Size | OpenMP | CUDA |
---|---|---|
1e2 | 0.65 | 0.30 |
1e3 | 1.69 | 0.82 |
1e4 | 5.56 | 4.05 |
1e5 | 4.69 | 7.88 |
1e6 | 5.29 | 9.44 |
1e7 | 6.36 | 9.28 |
Finally, OpenMP thread scaling was assessed vs. a serial computation for a fixed problem size of 1e7
:
Threads | Time (ms) | Speedup |
---|---|---|
1 | 419 | 1.00 |
2 | 310 | 1.35 |
4 | 178 | 2.35 |
6 | 121 | 3.46 |
8 | 90 | 4.66 |
10 | 73 | 5.75 |
12 | 58 | 7.22 |
The major takeaways of this study are:
- OpenMP runs (at least on this architecture) see no benefit from 2D parallelism. Struct-level parallelism in 1D is sufficient and smaller inner array sizes work well.
- CUDA runs can greatly benefit from 2D parallelism (at least on this architecture). Array-level parallelism in 1D was significantly more effective than struct-level parallelism (as expected).
- 2D parallelism with CUDA not only gave the best performance but enabled small, warp-sized inner arrays. This is good news for dynamic memory management as our memory allocations for the
AoSoA
are only possible in chunks the size of the inner arrays. If we were forced to only use Struct-of-Arrays on the GPU we would be in bad shape for dynamic memory management. - This study needs to be repeated on Power8/Power9, P100/V100, and Intel many-core architectures to have any real value.