SST Errors With Scaling
Created by: stevenwalton
I've been trying to test my Ascent code (an ECP project) on Rhea and been finding some interesting results. When I was trying to start "small" and set my number of writer processors to 64 and noticed that I wouldn't always get results back and the SST readers would crash.
So I started small and worked my way up. I would submit only the writer job and I had two different readers that I would specifically run on the login node (logged into specific node to avoid load balancing). The readers only use one processor (minimizing variables, and load to login node).
With the writer having 2,4, 8, 16 processors running I can get data back. In the case of 32 processors my second reader (one that handled less data) exited correctly and the first reader hangs. The behavior I was getting from the 64 processors is a report back from qsub that my program timed out and I would have no data from the readers (also submitted with the job), which would write after the final iteration they received.
These readers use StepStatus
to check if it should move on
while ( sstReader.BeginStep(adios2::StepMode::NextAvailable) == adios2::StepStatus::OK)
ABOUT RHEA: 512 Nodes with [2x] Intel® Xeon® E5-2650 @ 2.0 GHz – 8 cores, 16 HT and 128GB RAM.
Could this be connected to core count?
Side note: I am also writing BPFiles during this time, and have no problem with them. Even at the 64 processor level.