Investigate moving to r2c transforms
The AccFFT paper claims that r2c transforms have about half the computations and half the communication as c2c transforms. Their tests agree with that and show a 2x speedup for r2c.
With an in-place r2c transform, I think most of the machinery in the code stays the same. @4pf, we originally used r2c with AccFFT, right? And switched to c2c for compatibility with heFFTe, since at the time that's all it could do? Do we already have any performance info?