ITER restart writing error on Summit
Created by: jychoi-hpc
I am looking at the ITER restart writing error on Summit which Seung-Hoe is trying to run.
I found a case over the weekend in which I was able to reproduce the error consistently.
One thing I found is that, it happens with aggregation and the common error message is like:
[b03n12:54694] *** An error occurred in MPI_Isend
[b03n12:54694] *** reported by process [2154496201,1220]
[b03n12:54694] *** on communicator MPI COMMUNICATOR 13 SPLIT FROM 12
[b03n12:54694] *** MPI_ERR_COUNT: invalid count argument
[b03n12:54694] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
which happened around the following code (MPIChain.cpp):
helper::CheckMPIReturn(
MPI_Isend(sendBuffer.m_Buffer.data(),
static_cast<int>(sendBuffer.m_Position), MPI_CHAR,
m_Rank - 1, 1, m_Comm, &requests[1]),
", aggregation Isend data at iteration " + std::to_string(step) +
"\n");
I feel like there is a chance the count value in MPI_Isend (static_cast<int>(sendBuffer.m_Position)
can be overflowed. Is there any logic to handle when the buffer position is larger than 2^31 (the limit of signed int)?