HeatTransfer+BPfile+Summit segfault
Created by: philip-davis
Fresh clone of the master branch on summitdev Modules:
Currently Loaded Modules:
1) hsi/5.0.2.p5 2) xalt/0.7.5 3) lsf-tools/1.0 4) DefApps 5) gcc/7.1.0 6) spectrum-mpi/10.1.0.4-20170915 7) cmake/3.9.2
No special CMake flags. Attaching my CMake Cache file below: CMakeCache.txt
This happens when the writer is initializing, and I can reproduce it with a single writer rank:
-bash-4.2$ mpirun -n 1 heatTransfer_write_adios2 heat_bpfile.xml heat 1 1 10 10 1 1
[summitdev-login1:01588] *** Process received signal ***
[summitdev-login1:01588] Signal: Segmentation fault (11)
[summitdev-login1:01588] Signal code: Address not mapped (1)
[summitdev-login1:01588] Failing at address: (nil)
[summitdev-login1:01588] [ 0] [0x3fff7f000478]
[summitdev-login1:01588] [ 1] heatTransfer_write_adios2[0x1002ee3c]
[summitdev-login1:01588] [ 2] heatTransfer_write_adios2[0x1002e014]
[summitdev-login1:01588] [ 3] /lib64/libc.so.6(+0x24700)[0x3fff7e0c4700]
[summitdev-login1:01588] [ 4] /lib64/libc.so.6(__libc_start_main+0xc4)[0x3fff7e0c48f4]
[summitdev-login1:01588] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node summitdev-login1 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
-bash-4.2$
Debugger output:
-bash-4.2$ mpirun -n 1 gdb heatTransfer_write_adios2
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "ppc64le-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /autofs/nccs-svm1_home1/pdavis/summit/ADIOS2.main/ADIOS2/build/bin/heatTransfer_write_adios2...done.
(gdb) break main.cpp:67
Breakpoint 1 at 0x1002dfe4: file /ccs/home/pdavis/summit/ADIOS2.main/ADIOS2/examples/heatTransfer/write/main.cpp, line 67.
(gdb) run heat_bpfile.xml heat 1 1 10 10 1 1
Starting program: /ccs/home/pdavis/summit/ADIOS2.main/ADIOS2/build/bin/heatTransfer_write_adios2 heat_bpfile.xml heat 1 1 10 10 1 1
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
warning: File "/autofs/nccs-svm1_sw/summitdev/gcc/7.1.0/lib64/libstdc++.so.6.0.23-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py".
To enable execution of this file add
add-auto-load-safe-path /autofs/nccs-svm1_sw/summitdev/gcc/7.1.0/lib64/libstdc++.so.6.0.23-gdb.py
line to your configuration file "/ccs/home/pdavis/.gdbinit".
To completely disable this security protection add
set auto-load safe-path /
line to your configuration file "/ccs/home/pdavis/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
info "(gdb)Auto-loading safe path"
[New Thread 0x3fffb6bff1b0 (LWP 4678)]
[New Thread 0x3fffb63ff1b0 (LWP 4679)]
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7_3.4.ppc64le libibumad-1.3.10.2.MLNX20150406.966500d-0.1.34100.ppc64le libibverbs-1.2.1mlnx1-OFED.3.4.0.1.4.34200.ppc64le libmlx4-1.2.1mlnx1-OFED.3.4.0.0.4.34200.ppc64le libmlx5-1.2.1mlnx1-OFED.3.4.1.0.0.34200.ppc64le libnl-1.1.4-3.el7.ppc64le numactl-libs-2.0.9-6.el7_2.ppc64le opensm-libs-4.8.0.MLNX20161013.9b1a49b-0.1.34200.ppc64le
---Type <return> to continue, or q <return> to quit---
Breakpoint 1, main (argc=9, argv=0x3fffffffbf88)
at /ccs/home/pdavis/summit/ADIOS2.main/ADIOS2/examples/heatTransfer/write/main.cpp:67
67 IO io(settings, mpiHeatTransferComm);
(gdb) p ht.m_s
$1 = (const Settings &) @0x3fffffffb9b8: {configfile = {static npos =
<optimized out>,
_M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x10482668 "heat_bpfile.xml"}},
outputfile = {static npos = <optimized out>,
_M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x1047f968 "heat"}}, npx = 1,
npy = 1, ndx = 10, ndy = 10, steps = 1, iterations = 1, gndx = 10,
gndy = 10, posx = 0, posy = 0, offsx = 0, offsy = 0, rank = 0, nproc = 1,
rank_left = -1, rank_right = -1, rank_up = -1, rank_down = -1, async = false}
(gdb) next
69 ht.init(false);
(gdb) p ht.m_s
$2 = (const Settings &) @0x3fffffffb9b8: {configfile = {
static npos = <optimized out>,
_M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x3fffffffbab8 "h&H\020"}},
outputfile = {static npos = <optimized out>,
_M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x104b1970 "\230]g\267\377?"}},
---Type <return> to continue, or q <return> to quit---
npx = 1, npy = 1, ndx = 0, ndy = 0, steps = 0, iterations = 0, gndx = 0,
gndy = 0, posx = 0, posy = 0, offsx = 0, offsy = 0, rank = 0, nproc = 0,
rank_left = 0, rank_right = 0, rank_up = 0, rank_down = 0, async = false}
(gdb) continue
Continuing.
Program received signal SIGSEGV, Segmentation fault.
0x000000001002ee48 in HeatTransfer::init (this=0x3fffffffbb18,
init_with_rank=false)
at /ccs/home/pdavis/summit/ADIOS2.main/ADIOS2/examples/heatTransfer/write/HeatTransfer.cpp:68
68 sin(2 * y) + sin(y);
(gdb) p m_s
$3 = (const Settings &) @0x3fffffffb9b8: {configfile = {
static npos = <optimized out>,
_M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
_M_p = 0x7ff8000000000000 <Address 0x7ff8000000000000 out of bounds>}},
outputfile = {static npos = <optimized out>,
_M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>},
_M_p = 0x3fffffffba40 "\200\273\377\377\377?"}}, npx = 0, npy = 1,
ndx = 268627516, ndy = 0, steps = 268927232, iterations = 0, gndx = 0,
gndy = 0, posx = 12, posy = 0, offsx = 0, offsy = 2146435072, rank = 0,
nproc = 2146435072, rank_left = 0, rank_right = 2146435072, rank_up = 0,
---Type <return> to continue, or q <return> to quit---
rank_down = 2146435072, async = 24}
Notice that the values of ht.m_s.ndx and ht.m_s.ndy are different after the IO constructor is called. They are changed again by the time the code enters HeatTransfer::init
, and the index array goes out of bounds at line 66, leading to the segfault.
Another debugger output, placing a watch on the address of ndx:
-bash-4.2$ mpirun -n 1 gdb heatTransfer_write_adios2
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-94.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "ppc64le-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /autofs/nccs-svm1_home1/pdavis/summit/ADIOS2.main/ADIOS2/build/bin/heatTransfer_write_adios2...done.
(gdb) break main.cpp:67
Breakpoint 1 at 0x1002dfe4: file /ccs/home/pdavis/summit/ADIOS2.main/ADIOS2/examples/heatTransfer/write/main.cpp, line 67.
(gdb) run heat_bpfile.xml heat 1 1 10 10 1 1
Starting program: /ccs/home/pdavis/summit/ADIOS2.main/ADIOS2/build/bin/heatTransfer_write_adios2 heat_bpfile.xml heat 1 1 10 10 1 1
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
warning: File "/autofs/nccs-svm1_sw/summitdev/gcc/7.1.0/lib64/libstdc++.so.6.0.23-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load:/usr/bin/mono-gdb.py".
To enable execution of this file add
add-auto-load-safe-path /autofs/nccs-svm1_sw/summitdev/gcc/7.1.0/lib64/libstdc++.so.6.0.23-gdb.py
line to your configuration file "/ccs/home/pdavis/.gdbinit".
To completely disable this security protection add
set auto-load safe-path /
line to your configuration file "/ccs/home/pdavis/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
info "(gdb)Auto-loading safe path"
[New Thread 0x3fffb6bff1b0 (LWP 9642)]
[New Thread 0x3fffb63ff1b0 (LWP 9643)]
Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7_3.4.ppc64le libibumad-1.3.10.2.MLNX20150406.966500d-0.1.34100.ppc64le libibverbs-1.2.1mlnx1-OFED.3.4.0.1.4.34200.ppc64le libmlx4-1.2.1mlnx1-OFED.3.4.0.0.4.34200.ppc64le libmlx5-1.2.1mlnx1-OFED.3.4.1.0.0.34200.ppc64le libnl-1.1.4-3.el7.ppc64le numactl-libs-2.0.9-6.el7_2.ppc64le opensm-libs-4.8.0.MLNX20161013.9b1a49b-0.1.34200.ppc64le
---Type <return> to continue, or q <return> to quit---
Breakpoint 1, main (argc=9, argv=0x3fffffffbf88)
at /ccs/home/pdavis/summit/ADIOS2.main/ADIOS2/examples/heatTransfer/write/main.cpp:67
67 IO io(settings, mpiHeatTransferComm);
(gdb) p &ht.m_s.ndx
$1 = (unsigned int *) 0x3fffffffb9d0
(gdb) watch *(unsigned int *) 0x3fffffffb9d0
Hardware watchpoint 2: *(unsigned int *) 0x3fffffffb9d0
(gdb) continue
Continuing.
Hardware watchpoint 2: *(unsigned int *) 0x3fffffffb9d0
Old value = 10
New value = 0
0x0000000010031fa8 in IO::IO (this=0x3fff0000000c, s=..., comm=0x1047f968)
at /ccs/home/pdavis/summit/ADIOS2.main/ADIOS2/examples/heatTransfer/write/IO_adios2.cpp:22
22 IO::IO(const Settings &s, MPI_Comm comm)
(gdb)
Continuing.
Hardware watchpoint 2: *(unsigned int *) 0x3fffffffb9d0
Old value = 0
New value = 268627224
0x00003fffb7fcf22c in _dl_runtime_resolve () from /lib64/ld64.so.2
(gdb) bt
#0 0x00003fffb7fcf22c in _dl_runtime_resolve () from /lib64/ld64.so.2
---Type <return> to continue, or q <return> to quit---
#1 0x000000001002ed18 in HeatTransfer::init (this=0x3fffffffbb18,
init_with_rank=false)
at /ccs/home/pdavis/summit/ADIOS2.main/ADIOS2/examples/heatTransfer/write/HeatTransfer.cpp:66
#2 0x000000001002e014 in main (argc=9, argv=0x3fffffffbf88)
at /ccs/home/pdavis/summit/ADIOS2.main/ADIOS2/examples/heatTransfer/write/main.cpp:69
(gdb) continue
Continuing.
Hardware watchpoint 2: *(unsigned int *) 0x3fffffffb9d0
Old value = 268627224
New value = 268627260
0x00003fffb726c948 in cos () from /lib64/libm.so.6
(gdb) bt
#0 0x00003fffb726c948 in cos () from /lib64/libm.so.6
#1 0x000000001002ed3c in HeatTransfer::init (this=0x3fffffffbb18,
init_with_rank=false)
at /ccs/home/pdavis/summit/ADIOS2.main/ADIOS2/examples/heatTransfer/write/HeatTransfer.cpp:66
#2 0x000000001002e014 in main (argc=9, argv=0x3fffffffbf88)
at /ccs/home/pdavis/summit/ADIOS2.main/ADIOS2/examples/heatTransfer/write/main.cpp:69
(gdb) f 1
#1 0x000000001002ed3c in HeatTransfer::init (this=0x3fffffffbb18,
init_with_rank=false)
---Type <return> to continue, or q <return> to quit---
at /ccs/home/pdavis/summit/ADIOS2.main/ADIOS2/examples/heatTransfer/write/HeatTransfer.cpp:66
66 m_T1[i][j] = cos(8 * x) + cos(6 * x) - cos(4 * x) + cos(2 * x) -
(gdb) p m_s.ndx
$2 = 268627260
(gdb) next
Single stepping until exit from function cos,
which has no line number information.
Hardware watchpoint 2: *(unsigned int *) 0x3fffffffb9d0
Old value = 268627260
New value = 268627300
0x00003fffb726c948 in cos () from /lib64/libm.so.6
(gdb)
Single stepping until exit from function cos,
which has no line number information.
Hardware watchpoint 2: *(unsigned int *) 0x3fffffffb9d0
Old value = 268627300
New value = 268627328
0x00003fffb726c948 in cos () from /lib64/libm.so.6
(gdb)
Single stepping until exit from function cos,
which has no line number information.
HeatTransfer::init (this=0x3fffffffbb18, init_with_rank=false)
at /ccs/home/pdavis/summit/ADIOS2.main/ADIOS2/examples/heatTransfer/write/He---Type <return> to continue, or q <return> to quit---
atTransfer.cpp:67
67 cos(x) + sin(8 * y) - sin(6 * y) + sin(4 * y) -
(gdb) quit
A debugging session is active.
Inferior 1 [process 9613] will be killed.
Quit anyway? (y or n) [answered Y; input not from terminal]
It looks like ht.m_s is being stomped on by stack frames, given that even a call to cos()
is changing ht.m_s.ndx. I'm not very good at debugging the c++ runtime though, so I could easily be misinterpreting what I'm seeing.
I don't see this crash on an Ubunutu VM with GCC 5.4.