Dask issue: job cancelled
Created by: dyang37
Recently I've been running into the issue of nodes get canceled when performing multi-node computation with dask. Here's the error log from a canceled node:
2022-05-11 13:46:54,342 - distributed.nanny - INFO - Start Nanny at: 'tcp://172.18.32.195:46094'
2022-05-11 13:46:54,973 - distributed.diskutils - INFO - Found stale lock file and directory '/scratch/brown/yang1467/dask-worker-space/worker-ji_0ekoe', purging
2022-05-11 13:46:54,978 - distributed.diskutils - INFO - Found stale lock file and directory '/scratch/brown/yang1467/dask-worker-space/worker-ioqwzjv6', purging
2022-05-11 13:46:54,982 - distributed.diskutils - INFO - Found stale lock file and directory '/scratch/brown/yang1467/dask-worker-space/worker-tfitaup9', purging
2022-05-11 13:46:54,987 - distributed.diskutils - INFO - Found stale lock file and directory '/scratch/brown/yang1467/dask-worker-space/worker-_1gtjqme', purging
2022-05-11 13:46:55,033 - distributed.worker - INFO - Start worker at: tcp://172.18.32.195:44174
2022-05-11 13:46:55,033 - distributed.worker - INFO - Listening to: tcp://172.18.32.195:44174
2022-05-11 13:46:55,033 - distributed.worker - INFO - dashboard at: 172.18.32.195:43611
2022-05-11 13:46:55,033 - distributed.worker - INFO - Waiting to connect to: tcp://172.18.32.86:35988
2022-05-11 13:46:55,033 - distributed.worker - INFO - -------------------------------------------------
2022-05-11 13:46:55,034 - distributed.worker - INFO - Threads: 1
2022-05-11 13:46:55,034 - distributed.worker - INFO - Memory: 59.60 GiB
2022-05-11 13:46:55,034 - distributed.worker - INFO - Local Directory: /scratch/brown/yang1467/dask-worker-space/worker-l1vcmet_
2022-05-11 13:46:55,034 - distributed.worker - INFO - -------------------------------------------------
2022-05-11 13:46:55,054 - distributed.worker - INFO - Registered to: tcp://172.18.32.86:35988
2022-05-11 13:46:55,054 - distributed.worker - INFO - -------------------------------------------------
2022-05-11 13:46:55,054 - distributed.core - INFO - Starting established connection
slurmstepd: error: *** JOB 15619420 ON brown-a145 CANCELLED AT 2022-05-11T13:55:09 ***
I can confirm that this is not due to small death_timeout
duration, as I set death_timeout
to be 1200 sec, while the node cancelation happens rather early (~5mins after I got the nodes).
Furthermore, I observed that a large chunk of the multi-node jobs gets canceled:
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-11>
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-7>
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-3>
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-8>
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-4>
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-0>
<Future: finished, type: dict, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-2>
{
index: 2
host: brown-a390.rcac.purdue.edu
pid: 33870
time: 14:26:02
}
<Future: finished, type: dict, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-1>
{
index: 1
host: brown-a375.rcac.purdue.edu
pid: 57443
time: 14:26:41
}
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-9>
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-5>
<Future: finished, type: dict, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-6>
{
index: 6
host: brown-a390.rcac.purdue.edu
pid: 33870
time: 14:37:04
}
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-10>
[0, 3, 4, 5, 7, 8, 9, 10, 11]