Skip to content

GitLab

  • Menu
Projects Groups Snippets
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in
  • M mbircone
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 8
    • Issues 8
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Container Registry
    • Infrastructure Registry
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Code review
    • Issue
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Yang, Diyu
  • mbircone
  • Issues
  • #69

Closed
Open
Created May 11, 2022 by Yang, Diyu@yang1467Owner

Dask issue: job cancelled

Created by: dyang37

Recently I've been running into the issue of nodes get canceled when performing multi-node computation with dask. Here's the error log from a canceled node:

2022-05-11 13:46:54,342 - distributed.nanny - INFO -         Start Nanny at: 'tcp://172.18.32.195:46094'
2022-05-11 13:46:54,973 - distributed.diskutils - INFO - Found stale lock file and directory '/scratch/brown/yang1467/dask-worker-space/worker-ji_0ekoe', purging
2022-05-11 13:46:54,978 - distributed.diskutils - INFO - Found stale lock file and directory '/scratch/brown/yang1467/dask-worker-space/worker-ioqwzjv6', purging
2022-05-11 13:46:54,982 - distributed.diskutils - INFO - Found stale lock file and directory '/scratch/brown/yang1467/dask-worker-space/worker-tfitaup9', purging
2022-05-11 13:46:54,987 - distributed.diskutils - INFO - Found stale lock file and directory '/scratch/brown/yang1467/dask-worker-space/worker-_1gtjqme', purging
2022-05-11 13:46:55,033 - distributed.worker - INFO -       Start worker at:  tcp://172.18.32.195:44174
2022-05-11 13:46:55,033 - distributed.worker - INFO -          Listening to:  tcp://172.18.32.195:44174
2022-05-11 13:46:55,033 - distributed.worker - INFO -          dashboard at:        172.18.32.195:43611
2022-05-11 13:46:55,033 - distributed.worker - INFO - Waiting to connect to:   tcp://172.18.32.86:35988
2022-05-11 13:46:55,033 - distributed.worker - INFO - -------------------------------------------------
2022-05-11 13:46:55,034 - distributed.worker - INFO -               Threads:                          1
2022-05-11 13:46:55,034 - distributed.worker - INFO -                Memory:                  59.60 GiB
2022-05-11 13:46:55,034 - distributed.worker - INFO -       Local Directory: /scratch/brown/yang1467/dask-worker-space/worker-l1vcmet_
2022-05-11 13:46:55,034 - distributed.worker - INFO - -------------------------------------------------
2022-05-11 13:46:55,054 - distributed.worker - INFO -         Registered to:   tcp://172.18.32.86:35988
2022-05-11 13:46:55,054 - distributed.worker - INFO - -------------------------------------------------
2022-05-11 13:46:55,054 - distributed.core - INFO - Starting established connection
slurmstepd: error: *** JOB 15619420 ON brown-a145 CANCELLED AT 2022-05-11T13:55:09 ***

I can confirm that this is not due to small death_timeout duration, as I set death_timeout to be 1200 sec, while the node cancelation happens rather early (~5mins after I got the nodes).

Furthermore, I observed that a large chunk of the multi-node jobs gets canceled:

<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-11>
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-7>
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-3>
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-8>
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-4>
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-0>
<Future: finished, type: dict, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-2>
{
index: 2
host: brown-a390.rcac.purdue.edu
pid: 33870
time: 14:26:02
}
<Future: finished, type: dict, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-1>
{
index: 1
host: brown-a375.rcac.purdue.edu
pid: 57443
time: 14:26:41
}
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-9>
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-5>
<Future: finished, type: dict, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-6>
{
index: 6
host: brown-a390.rcac.purdue.edu
pid: 33870
time: 14:37:04
}
<Future: cancelled, key: parallel_func-497c8a35-73fe-455c-80f5-5fbe7a1d1f05-10>
[0, 3, 4, 5, 7, 8, 9, 10, 11]
Assignee
Assign to
Time tracking