Improve copies to temporary buffer for FFTs
Two potential improvements:
- Use a global memory allocation for the temporary view that holds the data for each batch of FFTs instead of each FFTFieldSet having its own allocation
- Move the loop over the batch members for the copy to the kernel level to increase the available parallelism.