Replace tuple-style cast with pointer/stride access
It was noted in #1 (closed) that the get operator will not fully vectorize due to the tuple-style cast hidden in getStructMember
. The pointers upfront are now cached in attempts to alleviate this. This changes the manner of data access in the AoSoA for arrays:
return getStructMember<I>(_raw_data[idx.s()]);
to:
return static_cast<struct_member_array_type<I> >( data<I>() + idx.s() * _strides[I] );
If this allows vectorization, then it may be well worth the extra arithmetic needed to index into the data block. Furthermore, if loop-invariant hoisting were done well by the compiler, idx.s() * _strides[I]
would be hoisted above the inner loop over arrays as a further optimization.
Cacheing pointers and strides also had a nice side effect for future C interfaces - template-free pointer and stride access! A user can now get strides and (un-typed void*) pointers directly through the following interface:
size_t PosX = 0;
double* pos_x = (double*) aosoa.pointer(PosX);
size_t pos_x_stride = aosoa.stride(PosX);
Merge request reports
Activity
assigned to @rfbird
mentioned in issue #1 (closed)
added 1 commit
- 5e5bc638 - storing raw member pointers and strides instead of single data block pointer
Well, that worked like a dream! Nice work Stuart.
Not only did it eliminate the indirect addressing, but it also shrunk the estimated loop cost from
422
to153
and allowed the speed-up to jump to basically flat 8x. All internal loops with fixed bounds are also fully unrolled. Right now I see no reason to believe this won't run pretty quick!Details below:
Old:
remark #15309: vectorization support: normalized vectorization overhead 0.098 remark #15301: SIMD LOOP WAS VECTORIZED remark #15463: unmasked indexed (or scatter) stores: 10 remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 422 remark #15477: vector cost: 162.500 remark #15478: estimated potential speedup: 2.590
New:
remark #15301: SIMD LOOP WAS VECTORIZED remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 153 remark #15477: vector cost: 19.000 remark #15478: estimated potential speedup: 7.990
added 1 commit
- 258c4850 - adding runtime rank and extent accessors to AoSoA and Slice
added 1 commit
- c74e9a95 - Adding runtime accessors to memory data to enable vectorization.
mentioned in commit 0faac5b6
assigned to @uy7