Dask issue: job cancelled

Created by: dyang37

Recently I've been running into the issue of nodes get canceled when performing multi-node computation with dask. Here's the error log from a canceled node:

2022-05-11 13:46:54,342 - distributed.nanny - INFO -         Start Nanny at: 'tcp://172.18.32.195:46094'
2022-05-11 13:46:54,973 - distributed.diskutils - INFO - Found stale lock file and directory '/scratch/brown/yang1467/dask-worker-space/worker-ji_0ekoe', purging
2022-05-11 13:46:54,978 - distributed.diskutils - INFO - Found stale lock file and directory '/scratch/brown/yang1467/dask-worker-space/worker-ioqwzjv6', purging
2022-05-11 13:46:54,982 - distributed.diskutils - INFO - Found stale lock file and directory '/scratch/brown/yang1467/dask-worker-space/worker-tfitaup9', purging
2022-05-11 13:46:54,987 - distributed.diskutils - INFO - Found stale lock file and directory '/scratch/brown/yang1467/dask-worker-space/worker-_1gtjqme', purging
2022-05-11 13:46:55,033 - distributed.worker - INFO -       Start worker at:  tcp://172.18.32.195:44174
2022-05-11 13:46:55,033 - distributed.worker - INFO -          Listening to:  tcp://172.18.32.195:44174
2022-05-11 13:46:55,033 - distributed.worker - INFO -          dashboard at:        172.18.32.195:43611
2022-05-11 13:46:55,033 - distributed.worker - INFO - Waiting to connect to:   tcp://172.18.32.86:35988
2022-05-11 13:46:55,033 - distributed.worker - INFO - -------------------------------------------------
2022-05-11 13:46:55,034 - distributed.worker - INFO -               Threads:                          1
2022-05-11 13:46:55,034 - distributed.worker - INFO -                Memory:                  59.60 GiB
2022-05-11 13:46:55,034 - distributed.worker - INFO -       Local Directory: /scratch/brown/yang1467/dask-worker-space/worker-l1vcmet_
2022-05-11 13:46:55,034 - distributed.worker - INFO - -------------------------------------------------
2022-05-11 13:46:55,054 - distributed.worker - INFO -         Registered to:   tcp://172.18.32.86:35988
2022-05-11 13:46:55,054 - distributed.worker - INFO - -------------------------------------------------
2022-05-11 13:46:55,054 - distributed.core - INFO - Starting established connection
slurmstepd: error: *** JOB 15619420 ON brown-a145 CANCELLED AT 2022-05-11T13:55:09 ***

I can confirm that this is not due to small death_timeout duration, as I set death_timeout to be 1200 sec, while the node cancelation happens rather early (~5mins after I got the nodes).

Furthermore, I observed that a large chunk of the multi-node jobs gets canceled:

<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-11>
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-7>
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-3>
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-8>
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-4>
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-0>
<Future: finished, type: dict, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-2>
{
index: 2
host: brown-a390.rcac.purdue.edu
pid: 33870
time: 14:26:02
}
<Future: finished, type: dict, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-1>
{
index: 1
host: brown-a375.rcac.purdue.edu
pid: 57443
time: 14:26:41
}
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-9>
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-5>
<Future: finished, type: dict, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-6>
{
index: 6
host: brown-a390.rcac.purdue.edu
pid: 33870
time: 14:37:04
}
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-10>
[0, 3, 4, 5, 7, 8, 9, 10, 11]