Different kinds of engines and avoiding copies
Created by: germasch
So while trying to avoid stirring up specifics on API design, I think it would be useful to have a general overview of what the different engines do under the hood, and where copies / buffering can potentially be avoided.
I certainly haven't looked in detail what the various engines do, so I may be wrong on aspects, and I'll definitely be incomplete. I think this could also be useful for compiling a list of behavior that the app developers need to know, in particular about which calls for which engines might be blocking, or, e.g., MPI collective, and under which circumstances.
BP3
I think I have a somewhat decent understanding what happens with file-backed BP3, when using one file per proc.
-
PutSync
will take the data and copy it into its internal buffer, which is essentially the content of (or a chunk of, if too large to buffer) the eventual file kept in memory. Should non-blocking. Might trigger a realloc of the buffer to extend it. Might trigger a flush when max buffer size is set and reached. -
PutDeferred
only stores the data pointer and associated info for later use inPerformPuts
. -
PerformPuts
essentially runsPutSync
on all data pointers previously stored byPutDeferred
, so the same copy and blocking behavior applies. -
EndStep
may flush, otherwise doesn't do much beyondPerformPuts
.
In the typical case, using either PutSync
or PutDeferred
will involve two copies to get the data into the file. (1) from the application memory into the BP3 buffer, (2) from BP3 buffer to kernel space. (I'll stop here. if an actual disk is attached, it the controller may DMA from the kernel buffer. For remote filesystems, which I suppose is the typical use case, the network adapter may be doing the DMAing, or RDMAing). The Span interface allows to avoid the copy (1), with some caveats, isn't otherwise used to extend lifetime of the user provided data.
TODO: What happens when MPI aggregation is involved? I suppose MPI blocking or even collective behavior may be involved.
InsituMPI
(I'm just going by what I see looking through the code)
-
PutSync
is only supported for single values, not for array data. Single values will be buffered locally. If max buffer size is set and reached, it will throw an exception (not handled). Otherwise will not block. -
PutDeferred
willMPI_Isend
the data if definitions are locked on both sender and receiver side. Otherwise, the data pointer will be retained for later use in PerformPuts. As above, may run out of buffer space, which is not handled. Otherwise will not block. -
PerformPuts
collectively communicates metadata info. Will thenMPI_Isend
PutDeferred
variables that not already sent inPutDeferred above. Will not wait for the
MPI_Isend`s to complete. -
EndStep
will wait for theMPI_Isend
s to complete, and wait for acknowledgement from readers (collective). May potentially block a long time if the reader isn't ready for the step yet.
The Span interface, if it were implemented, could allow to avoid a copy into the internal buffer. The way it looks to me, that copy also could be avoided when having PutDeferred
directly MPI_Isend()
the user buffer, but in this case PerformPuts
would have to MPI_Wait
, instead of waiting until EndStep
.
SST
(I'm just going by what I see looking through the code, and I'm only looking at BP3 marshaling)
-
PutSync
: copies into BP3 buffer. Won't block. Doesn't check resizeResult, so either it is impossible to set MaxBufferSize, or otherwise there's a bug in that it at least should throw an exception because it can't handle a flush -
PutDeferred
: actually just callsPutSync
, so same behavior -
PerformPuts
: does nothing (since nothing is ever deferred) -
EndStep
: provides the timestep to the receiver. Does not wait for the receiver to actually receive, consequently, the internal buffer needs to be retained until a later time.
Copy into BP3 buffer could be avoided using Span interface. Could also potentially be avoided by starting communication in PutDeferred
and waiting for completion in PerformPuts
, that, however, may not be desirable as it looks to me that the goal is to not have sender and receiver go in lockstep, ie, prevent blocking in PerformPuts
or EndStep
.
Other engines: TBD
A lot of the current behavior is implied by the use of underlying BP3 marshaling in the various writer engines (SST has an alternative, but I haven't looked a that). Since that is built around a single buffer, it requires a copy to get into the right position, unless the Span interface is used. A side effect of maintaining its own buffer is that it essentially assumes ownership of the user provided data (by copying it).
If people want to extend / preserve / correct this list, one option would be to put it into a wiki page. Though it looks like people like me then wouldn't be able to edit it ;(