ITER restart writing error on Summit

Created by: jychoi-hpc

I am looking at the ITER restart writing error on Summit which Seung-Hoe is trying to run.

I found a case over the weekend in which I was able to reproduce the error consistently.

One thing I found is that, it happens with aggregation and the common error message is like:

[b03n12:54694] *** An error occurred in MPI_Isend
[b03n12:54694] *** reported by process [2154496201,1220]
[b03n12:54694] *** on communicator MPI COMMUNICATOR 13 SPLIT FROM 12
[b03n12:54694] *** MPI_ERR_COUNT: invalid count argument
[b03n12:54694] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,

which happened around the following code (MPIChain.cpp):

       helper::CheckMPIReturn(
            MPI_Isend(sendBuffer.m_Buffer.data(),
                      static_cast<int>(sendBuffer.m_Position), MPI_CHAR,
                      m_Rank - 1, 1, m_Comm, &requests[1]),
            ", aggregation Isend data at iteration " + std::to_string(step) +
                "\n");

I feel like there is a chance the count value in MPI_Isend (static_cast<int>(sendBuffer.m_Position) can be overflowed. Is there any logic to handle when the buffer position is larger than 2^31 (the limit of signed int)?