GPU parallelization scheme confusion
After looking at the ApplyA
function within https://code.ornl.gov/lmm/DG-SparseGrid/blob/master/Vlasov-Poisson-version2/TimeAdvance.m in @lmm 's code, I'm a touch concerned about how this gets parallelized on the GPU. My understanding is that we want each thread to do the same thing, but here, if we were to parallelize over rows (DOF), then due to the connectivity of each row being different (i.e., the number of nonzeros in each row is different), then each thread would have a different amount of work. I may be a step behind here and @elwasif and @atj and @e6d have already come up with a GPU friendly approach to parallelizing the application of the matrix-vector multiply for a sparse matrix like ours. Perhaps someone can educate me - or is the answer as simple as parallelizing over all the elements within A, not just rows?