Crash with PGI compilers on POWER9
Created by: khuck
Describe the bug
When compiling/testing the gray-scott tutorial example with TAU measurement and PGI compilers, the simulation crashes on exit. Valgrind shows that ADIOS is deleting the MPI Communicator more than once (during the adios2::helper::Comm::~Comm
destructor), which proves to be fatal when PGI is the compiler and TAU is preloaded in the environment. Here's the valgrind output:
==146283== Invalid read of size 1
==146283== at 0x4723DD4: ompi_comm_invalid (communicator.h:334)
==146283== by 0x4723BDB: PMPI_Comm_free (pcomm_free.c:51)
==146283== by 0x744B2C7: adios2::helper::CommImplMPI::~CommImplMPI() (adiosCommMPI.cpp:191)
==146283== by 0x744B35F: adios2::helper::CommImplMPI::~CommImplMPI() (adiosCommMPI.cpp:194)
==146283== by 0x6B6AD2B: std::default_delete<adios2::helper::CommImpl>::operator()(adios2::helper::CommImpl*) const (unique_ptr.h:67)
==146283== by 0x6B6B5C3: std::unique_ptr<adios2::helper::CommImpl, std::default_delete<adios2::helper::CommImpl> >::~unique_ptr() (unique_ptr.h:184)
==146283== by 0x6B6D19F: adios2::helper::Comm::~Comm() (new_allocator.h:114)
==146283== by 0x695E887: adios2::core::Engine::~Engine() (Engine.cpp:31)
==146283== by 0x695E96F: adios2::core::Engine::~Engine() (Engine.cpp:31)
==146283== by 0x6D8EBB3: adios2::core::engine::BP4Writer::~BP4Writer() (BP4Writer.cpp:41)
==146283== by 0x6A4394B: void __gnu_cxx::new_allocator<adios2::core::engine::BP4Writer>::destroy<adios2::core::engine::BP4Writer>(adios2::core::engine::BP4Writer*) (new_allocator.h:124)
==146283== by 0x6A28ED7: std::enable_if<std::allocator_traits<std::allocator<adios2::core::engine::BP4Writer> >::__destroy_helper<adios2::core::engine::BP4Writer>::value, void>::type std::allocator_traits<std::allocator<adios2::core::engine::BP4Writer> >::_S_destroy<adios2::core::engine::BP4Writer>(std::allocator<adios2::core::engine::BP4Writer>&, adios2::core::engine::BP4Writer*) (alloc_traits.h:281)
==146283== Address 0xe3e2d98 is 232 bytes inside a block of size 344 free'd
==146283== at 0x408550C: free (vg_replace_malloc.c:530)
==146283== by 0x46DEC1F: ompi_comm_free (comm.c:1491)
==146283== by 0x4723C3F: PMPI_Comm_free (pcomm_free.c:62)
==146283== by 0x443136F: MPI_Comm_free (TauMpi.c:1285)
==146283== by 0x744B48B: adios2::helper::CommImplMPI::Free(std::string const&) (adiosCommMPI.cpp:201)
==146283== by 0x6B6D2EF: adios2::helper::Comm::Free(std::string const&) (adiosComm.cpp:56)
==146283== by 0x695F08F: adios2::core::Engine::Close(int) (Engine.cpp:73)
==146283== by 0x759F437: adios2::Engine::Close(int) (Engine.cpp:142)
==146283== by 0x10029467: Writer::close() (writer.cpp:125)
==146283== by 0x1000AFDB: main (main.cpp:140)
==146283== Block was alloc'd at
==146283== at 0x4083F40: malloc (vg_replace_malloc.c:299)
==146283== by 0x46DFCCB: opal_obj_new (opal_object.h:486)
==146283== by 0x46DC8BB: ompi_comm_set_nb (comm.c:160)
==146283== by 0x46DC79F: ompi_comm_set (comm.c:117)
==146283== by 0x46DDCFB: ompi_comm_dup_with_info (comm.c:1000)
==146283== by 0x46DDC2F: ompi_comm_dup (comm.c:971)
==146283== by 0x4722C17: PMPI_Comm_dup (pcomm_dup.c:63)
==146283== by 0x44312AB: MPI_Comm_dup (TauMpi.c:1268)
==146283== by 0x744B50F: adios2::helper::CommImplMPI::Duplicate(std::string const&) const (adiosCommMPI.cpp:208)
==146283== by 0x6B6D3A3: adios2::helper::Comm::Duplicate(std::string const&) const (adiosComm.cpp:60)
==146283== by 0x6A56747: adios2::core::IO::Open(std::string const&, adios2::Mode) (IO.cpp:692)
==146283== by 0x75D065B: adios2::IO::Open(std::string const&, adios2::Mode) (IO.cpp:110)
To Reproduce
- Build ADIOS2 with PGI compiler and PGI build of MPI
- Build TAU with PGI compiler and PGI build of MPI
- Build the gray-scott tutorial example with PGI
- Run with
mpirun -np ... tau_exec -T pgi build/gray-scott ...
and the program will crash on exit.
Expected behavior No crash
Desktop (please complete the following information):
- OS/Platform: IBM Power9 system at UO (similar to summit node)
- Build PGI 19.4, MPI that comes with PGI, ADIOS2 master, TAU current master
Additional context I'll be submitting a PR to fix this bug in a few minutes...