Different kinds of engines and avoiding copies

Created by: germasch

So while trying to avoid stirring up specifics on API design, I think it would be useful to have a general overview of what the different engines do under the hood, and where copies / buffering can potentially be avoided.

I certainly haven't looked in detail what the various engines do, so I may be wrong on aspects, and I'll definitely be incomplete. I think this could also be useful for compiling a list of behavior that the app developers need to know, in particular about which calls for which engines might be blocking, or, e.g., MPI collective, and under which circumstances.

BP3

I think I have a somewhat decent understanding what happens with file-backed BP3, when using one file per proc.

PutSync will take the data and copy it into its internal buffer, which is essentially the content of (or a chunk of, if too large to buffer) the eventual file kept in memory. Should non-blocking. Might trigger a realloc of the buffer to extend it. Might trigger a flush when max buffer size is set and reached.
PutDeferred only stores the data pointer and associated info for later use in PerformPuts.
PerformPuts essentially runs PutSync on all data pointers previously stored by PutDeferred, so the same copy and blocking behavior applies.
EndStep may flush, otherwise doesn't do much beyond PerformPuts.

In the typical case, using either PutSync or PutDeferred will involve two copies to get the data into the file. (1) from the application memory into the BP3 buffer, (2) from BP3 buffer to kernel space. (I'll stop here. if an actual disk is attached, it the controller may DMA from the kernel buffer. For remote filesystems, which I suppose is the typical use case, the network adapter may be doing the DMAing, or RDMAing). The Span interface allows to avoid the copy (1), with some caveats, isn't otherwise used to extend lifetime of the user provided data.

TODO: What happens when MPI aggregation is involved? I suppose MPI blocking or even collective behavior may be involved.

InsituMPI

(I'm just going by what I see looking through the code)

PutSync is only supported for single values, not for array data. Single values will be buffered locally. If max buffer size is set and reached, it will throw an exception (not handled). Otherwise will not block.
PutDeferred will MPI_Isend the data if definitions are locked on both sender and receiver side. Otherwise, the data pointer will be retained for later use in PerformPuts. As above, may run out of buffer space, which is not handled. Otherwise will not block.
PerformPuts collectively communicates metadata info. Will then MPI_Isend PutDeferred variables that not already sent in PutDeferred above. Will not wait for the MPI_Isend`s to complete.
EndStep will wait for the MPI_Isends to complete, and wait for acknowledgement from readers (collective). May potentially block a long time if the reader isn't ready for the step yet.

The Span interface, if it were implemented, could allow to avoid a copy into the internal buffer. The way it looks to me, that copy also could be avoided when having PutDeferred directly MPI_Isend() the user buffer, but in this case PerformPuts would have to MPI_Wait, instead of waiting until EndStep.

SST

(I'm just going by what I see looking through the code, and I'm only looking at BP3 marshaling)

PutSync: copies into BP3 buffer. Won't block. Doesn't check resizeResult, so either it is impossible to set MaxBufferSize, or otherwise there's a bug in that it at least should throw an exception because it can't handle a flush
PutDeferred: actually just calls PutSync, so same behavior
PerformPuts: does nothing (since nothing is ever deferred)
EndStep: provides the timestep to the receiver. Does not wait for the receiver to actually receive, consequently, the internal buffer needs to be retained until a later time.

Copy into BP3 buffer could be avoided using Span interface. Could also potentially be avoided by starting communication in PutDeferred and waiting for completion in PerformPuts, that, however, may not be desirable as it looks to me that the goal is to not have sender and receiver go in lockstep, ie, prevent blocking in PerformPuts or EndStep.

Other engines: TBD

A lot of the current behavior is implied by the use of underlying BP3 marshaling in the various writer engines (SST has an alternative, but I haven't looked a that). Since that is built around a single buffer, it requires a copy to get into the right position, unless the Span interface is used. A side effect of maintaining its own buffer is that it essentially assumes ownership of the user provided data (by copying it).

If people want to extend / preserve / correct this list, one option would be to put it into a wiki page. Though it looks like people like me then wouldn't be able to edit it ;(

Admin message

Different kinds of engines and avoiding copies

BP3

InsituMPI

SST

Other engines: TBD