data layout / BP-X

Created by: germasch

This is somewhat BP-X related, but also a more generic API issue, so I figured I'll put the details here.

Language-independent handling of row-major vs col-major data layout

Here's a proposal to increase flexibility in handling data layout while keeping the existing behavior / APIs unchanged.

Rationale

Currently, the ADIOS2 API is built around the assumption that the language of the API bindings determines the data layout that the application is using, in particular Fortran/Matlab/R: column-major; otherwise: row-major (see adiosSystem.cpp). That covers most common applications. In particular reading and writing data from the same language works nicely. However, mixing languages or data layouts is more complicated.

When Fortran writes, say, a 400x200 (column-major) array, it will be presented in the C/C++ API as a 200x400 row-major array. In order to present the data as a row-major array, there are only two options: reverse the shape, or transpose the data. Since transposition is potentially expensive, ADIOS2 make the perfectly reasonable choice to instead reverse the dimensions.

However, the assumption that the language of the API determines the data layout used is not always valid. For example:

The main part of a legacy code remains written in Fortran, hence it's using column-major data layout for its arrays. However, some parts of the code are rewritten in C++, e.g., in order to enable use of CUDA / Kokkos (this is currently happening for XGC). At this point, if one writes data using the adios2 C++ API, it makes the assumption that the data layout is row-major, even though it actually is not. This can be worked around on the application side by pretending that the column-data is row-major if the app reverses dims itself. However, one loses the information whether a dataset is actually row-major, or just pretending to be.
Essentially the same issue happens in a pure C++ code that uses data structures that aren't (necessarily) row-major, like, e.g., xtensor or Kokkos Views. Again, it can be worked around by reversing dims on the application side, but the info what the real layout is gets lost.
A general visualization app, say written in C++, sees a data set of size (200x400). It doesn't know whether that means there are really 200 points in x direction and 400 points in y direction, or whether it is looking at an actual 400x200 data set that had been written using the Fortran API. This is somewhat solved for the unique language->layout mapping case by #1573, though (a) the app has to go down to the block level to check the IsReverseDims flag and (b) the IsReverseDims flag doesn't directly tell the original data layout, but requires additional logic from the application. The viz app then has to reverse dims itself again to read data into a data structure using the original layout.

The visualization app still has no way of knowing whether the writing app had reversed the dims itself because it falls into either of the cases above.

How relevant is this?

XGC is going to a mixed Fortran / C++ model, where C++ is using col-major data structures. For now, writing the data is still handled in Fortran, so there is no actual issue yet.
My particle-in-cell code (PSC) falls into the category of a C++ code using column-major data layout (for historic reasons), so I currently I'm passing fake (reversed) dims to ADIOS2 to make things work. (Actually, the same has to be done when using HDF5, but I think this is really a weakness of HDF5 which shouldn't be perpetuated.)
@berkgeveci supported the usefulness of knowing row-major vs col-major in #1553 (closed).

Proposed data format changes

BP3 / BP4 have global flags that indicate the host language, ie., assumed data layout. That's helpful, but having per-dataset granularity would be better. Per-dataset granualarity would be better, so one could write both row-major and col-major dataset into the same file. The API changes I'm proposing don't necessarily require changes to the data format, which might well be not possible at this point for compatibility reasons, but I think it'd be good to consider this for future evolutions of the data formats. One could use attributes to store the information as attributes given that existing mechanism, but I think it's a bit iffy since there is no distinction between user-defined and system-defined attributes, so there could be conflicts if one were to do that.) The existing BP3 / BP4 would support the additional API described below, but they would not support mixing row-major and col-major datasets in a single file.

Proposed API

(I'm just making this up as I'm writing this. But I hope it'll give a useful basis for thinking about it.)

Each dataset (Variable) would have a Layout property associated with it. When creating/inquiring variable for writing/reading, the Layout property would be set to the current host language.

Add a Variable::setLayout() function that would allow the application to override the default layout. E.g., in C++ I could go say myVar.setLayout(Layout::ColumnMajor) to tell adios2 that what I'm passing to it will be laid out in ColumMajor. For now, adios2 would just revere the dims for me, writing the data while pretending it's the row-major layout that it's expecting. Ideally, adios2 would also record the original data layout, but that's data format dependent (see above).

When reading, the same mechanism would apply. If I know the data I'll be reading is actually column-major, and my data structure expects the blob of data to be in column-major, I'll do a myVar.SetLayout(Layout::ColumnMajor) first. If the actual data is not already columnn-major, adios2 would reverse dims for me (basically just like what currently happens when file language and API host language don't match, but it'd be based on the layout I'm telling, not the language I'm using, though in the default case nothing would change.)

Finally, add a Variable::GetOriginalLayout() function, which would for a given variable tell me what the original layout was when it was written. So a dataset written by Fortran would return "ColumnMajor", while a dataset written in C++ would return "RowMajor". That way, an application that uses a data structure that supports both row-major and col-major could load the data back into the appropriate data structure. without having to deal with either reversing dims or transposing data. Something like this:

auto myVar = InquireVariable(...);
auto layout = myVar.GetOriginalLayout();
myVar.SetLayout(layout);
auto shape = myVar.Shape();
xt::xarray<double, xt::layout_type::dynamic> var;
if (layout == Layout::RowMajor) {
  var = {shape, xt::layout_type::row_major};
} else {
  var = {shape, xt::layout_type::column_major};
}
myVar.Get(var.data());
// no need to worry about data layout or reversed dims in the 
// var we read from here on out, it'll just work :)