Skip to content

introduce DataType as separate type

Podhorszki, Norbert requested to merge github/fork/germasch/datatype into master

Created by: germasch

Alright, so this is going to be lengthy (both the code and this description). I can break this up into pieces that could be merged one at a time. (And actually, there are a couple more things that could/should be done on top of this, too).

I hadn't really planned to dive into ADIOS2, and I'll probably mostly disappear from here again soon, but I feel like I would like to see through what I started. This all began when I hit problems due to adios2 internally being templated for 11 integer types, while it really only could properly distinguish the 8 stdint types ({u,}int{8,16,32,64}_t. The PRs I provided so far, which moved most of adios2's internals to the stdint-based types have already been merged. Actually, in making those PRs, I had already used something akin to the below, which made it possible to trace where and how the various types are used and what the expected values are.

The primary objective of this PR is to introduce a class DataType as a replacement for where right now std::string is used. For most of this series, DataType actually is just a strongly typed wrapper around std::string, which serves to document where strings are used to designate an adios2 types vs just some other string vs, e.g., a FFS data type. It then also narrows down DataType to be only one out of a set of predefined strings (and doing that, it turns out while we tested again "unknown", there's actually no way a data type string would ever have been set to "unknown". Finally, it actually replaces DataType to be just a wrapper around an int, rather than around std::string. That last part could / should be continued to actually use an enum, and doing this smartly, one could use that same enum in the C interface and in bpls, avoiding some mapping code.

This PR touches a lot of code, though most of it in fairly trivial ways. So I feel I better justify it:

  • None of the bindings are affected. (Though it does open opportunities to improve them)
  • DataType is strongly typed (it converts only explicitly to/from a string where needed)
  • DataType enforces a predefined subset of types, so no typos or accidental new inventions are possible. (This had happened, though not in ways that actually caused problems, e.g., comparing a type string to "int64_t", which is not a value it ever could have had.
  • DataType gets us close to being able to rename the type strings from the kinda unpleasant primitive-based names ("unsigned long long int") to the much clearer "uint64_t".
  • adios2 uses enum classes for pretty much every other choice, only the data type is represented by std::string right now.
  • While performance is obviously going to be mostly dominated by network and file system, the code passes type strings around a lot, and compares them even more (all the ADIOS2_FOREACH_* loops). Some of them are actually in places which I think are somewhat performance sensitive, like in adiosMemory.h. Small string optimization means that for the most part dynamic memory allocation is avoided, at least, though I believe std::string("unsigned long long int") actually triggers a malloc on gcc. Comparing strings is definitely more complex than comparing integers, too.
  • As a somewhat arbitrary indication, I ran size libadios2.2.3.1.dylib on my mac:
master:
__TEXT	__DATA	__OBJC	others	dec	hex
16019456	98304	0	6811648	22929408	15de000

after this PR:
__TEXT	__DATA	__OBJC	others	dec	hex
15056896	98304	0	6815744	21970944	14f4000

That's actually a rather noticeable (1MB) decrease in __TEXT (actual machine code). I believe this mostly because I statically allocate the possible DataTypes, rather than creating the strings on demand all the time, not so much because of the string -> int change.

FWIW, after these changes I can tell that there are only a few places left that use the string representation of types. They almost all use the (successor of) GetType<T>(), so they are independent of the actual string chosen. The exception is ffs_marshal.c which as a C file cannot employ that mechanism, and hence would benefit from available enum (Though this PR doesn't go there yet). After that, the string representations are only used in debug messages and in the bindings. E.g. bpls output would be affected, and I'm not sure we'd want to go there, but that's another discussion.

Merge request reports