Different kinds of engines and avoiding copies
Created by: germasch
So while trying to avoid stirring up specifics on API design, I think it would be useful to have a general overview of what the different engines do under the hood, and where copies / buffering can potentially be avoided.
I certainly haven't looked in detail what the various engines do, so I may be wrong on aspects, and I'll definitely be incomplete. I think this could also be useful for compiling a list of behavior that the app developers need to know, in particular about which calls for which engines might be blocking, or, e.g., MPI collective, and under which circumstances.
I think I have a somewhat decent understanding what happens with file-backed BP3, when using one file per proc.
PutSyncwill take the data and copy it into its internal buffer, which is essentially the content of (or a chunk of, if too large to buffer) the eventual file kept in memory. Should non-blocking. Might trigger a realloc of the buffer to extend it. Might trigger a flush when max buffer size is set and reached.
PutDeferredonly stores the data pointer and associated info for later use in
PutSyncon all data pointers previously stored by
PutDeferred, so the same copy and blocking behavior applies.
EndStepmay flush, otherwise doesn't do much beyond
In the typical case, using either
PutDeferred will involve two copies to get the data into the file. (1) from the application memory into the BP3 buffer, (2) from BP3 buffer to kernel space. (I'll stop here. if an actual disk is attached, it the controller may DMA from the kernel buffer. For remote filesystems, which I suppose is the typical use case, the network adapter may be doing the DMAing, or RDMAing). The Span interface allows to avoid the copy (1), with some caveats, isn't otherwise used to extend lifetime of the user provided data.
TODO: What happens when MPI aggregation is involved? I suppose MPI blocking or even collective behavior may be involved.
(I'm just going by what I see looking through the code)
PutSyncis only supported for single values, not for array data. Single values will be buffered locally. If max buffer size is set and reached, it will throw an exception (not handled). Otherwise will not block.
MPI_Isendthe data if definitions are locked on both sender and receiver side. Otherwise, the data pointer will be retained for later use in PerformPuts. As above, may run out of buffer space, which is not handled. Otherwise will not block.
PerformPutscollectively communicates metadata info. Will then
PutDeferredvariables that not already sent in
PutDeferred above. Will not wait for theMPI_Isend`s to complete.
EndStepwill wait for the
MPI_Isends to complete, and wait for acknowledgement from readers (collective). May potentially block a long time if the reader isn't ready for the step yet.
The Span interface, if it were implemented, could allow to avoid a copy into the internal buffer. The way it looks to me, that copy also could be avoided when having
MPI_Isend() the user buffer, but in this case
PerformPuts would have to
MPI_Wait, instead of waiting until
(I'm just going by what I see looking through the code, and I'm only looking at BP3 marshaling)
PutSync: copies into BP3 buffer. Won't block. Doesn't check resizeResult, so either it is impossible to set MaxBufferSize, or otherwise there's a bug in that it at least should throw an exception because it can't handle a flush
PutDeferred: actually just calls
PutSync, so same behavior
PerformPuts: does nothing (since nothing is ever deferred)
EndStep: provides the timestep to the receiver. Does not wait for the receiver to actually receive, consequently, the internal buffer needs to be retained until a later time.
Copy into BP3 buffer could be avoided using Span interface. Could also potentially be avoided by starting communication in
PutDeferred and waiting for completion in
PerformPuts, that, however, may not be desirable as it looks to me that the goal is to not have sender and receiver go in lockstep, ie, prevent blocking in
Other engines: TBD
A lot of the current behavior is implied by the use of underlying BP3 marshaling in the various writer engines (SST has an alternative, but I haven't looked a that). Since that is built around a single buffer, it requires a copy to get into the right position, unless the Span interface is used. A side effect of maintaining its own buffer is that it essentially assumes ownership of the user provided data (by copying it).
If people want to extend / preserve / correct this list, one option would be to put it into a wiki page. Though it looks like people like me then wouldn't be able to edit it ;(