Describe the bug
Dagu version 2.6.5, 5 workers on k8s
Sub diag fails with:
distributed run lease expired: worker ... accepted the task claim but stopped reporting to the owner coordinator
Although the workflow itself is correctly shown as 'failed', sub- diag itself still shows a step as 'running'
To Reproduce
Steps to reproduce the behavior:
- Create a DAG with sub diags
This is happening with workflow which has sub-dags and used to work fine. Also stating that this is NOT
consistently reproduceable!
Expected behavior
Workflow with sub-dags successfully completes.
Actual behavior
Workflow fails with several sub-dags showing 'distributed run lease expired' errors, and each sub diag shows a step still running and unable to stop...
Environment
- Dagu version: 2.6.5
- OS: Ubuntu 24.04.4
- Go version:
- Installation method: Dagu Helm chart
DAG configuration
If applicable, paste the relevant DAG YAML (redact any sensitive information):
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.
Worker log
time=2026-05-11T00:00:14.868+10:00 level=INFO msg="Task received, starting execution" worker-id=dagu-worker-general-57cdb8d8f9-cx7wx poller-index=61 run-id=H1chb4BobdmGwLzvZnD3ZsJy1ibK3y8Xzv6Kp8f9CnFa
time=2026-05-11T00:00:15.866+10:00 level=INFO msg="Executing task" operation=OPERATION_START target=vault-backup run-id=H1chb4BobdmGwLzvZnD3ZsJy1ibK3y8Xzv6Kp8f9CnFa root-dag-run-id=019e1230-0f02-76e5-be7f-beefcc69a53e parent-dag-run-id=019e1230-0f02-76e5-be7f-beefcc69a53e worker-id=dagu-worker-general-57cdb8d8f9-cx7wx
time=2026-05-11T00:00:15.867+10:00 level=INFO msg="Creating temporary DAG file from definition" dag=vault-backup size=880
time=2026-05-11T00:00:15.867+10:00 level=INFO msg="Created temporary DAG file" file=/tmp/dagu/worker-dags/vault-backup-2332223915.yaml
time=2026-05-11T00:00:56.010+10:00 level=INFO msg="Distributed task execution finished" operation=OPERATION_START target=/tmp/dagu/worker-dags/vault-backup-2332223915.yaml run-id=H1chb4BobdmGwLzvZnD3ZsJy1ibK3y8Xzv6Kp8f9CnFa
time=2026-05-11T00:00:56.010+10:00 level=INFO msg="Task execution completed successfully" worker-id=dagu-worker-general-57cdb8d8f9-cx7wx poller-index=61 run-id=H1chb4BobdmGwLzvZnD3ZsJy1ibK3y8Xzv6Kp8f9CnFa
time=2026-05-11T00:01:10.857+10:00 level=INFO msg="CoordinatorCli connection recovered" previous-consecutive-failures=1
time=2026-05-11T00:01:48.750+10:00 level=INFO msg="CoordinatorCli connection recovered" previous-consecutive-failures=1
time=2026-05-11T00:01:55.358+10:00 level=INFO msg="CoordinatorCli connection recovered" previous-consecutive-failures=2
time=2026-05-11T00:01:59.966+10:00 level=INFO msg="CoordinatorCli connection recovered" previous-consecutive-failures=1
time=2026-05-11T00:03:05.673+10:00 level=INFO msg="CoordinatorCli connection recovered" previous-consecutive-failures=2
time=2026-05-11T00:03:07.057+10:00 level=INFO msg="CoordinatorCli connection recovered" previous-consecutive-failures=1
time=2026-05-11T00:17:23.662+10:00 level=INFO msg="CoordinatorCli connection recovered" previous-consecutive-failures=1
time=2026-05-11T00:17:27.457+10:00 level=INFO msg="Task polled successfully" run-id=3wCTWETWAnBy4dezW6AWnk2TVXx5aPEnnWU8PQuszcuw target=get-run-btbw-job-script worker-selector=map[] coordinator-id=dagu-coordinator-658b985f87-q56wn@50055
time=2026-05-11T00:17:27.457+10:00 level=INFO msg="Task received" worker-id=dagu-worker-general-57cdb8d8f9-cx7wx poller-id=f4414b19-bee6-4cab-a8a3-da7ce079f5af poller-index=95 root-dag-run-name=nightly-backups root-dag-run-id=019e1230-0f02-76e5-be7f-beefcc69a53eparent-dag-run-name=firefly-backup parent-dag-run-id=HnMzpApZiY3DxQnXaeXqPTfenz3iamq1oxwLcNdfCLGt run-id=3wCTWETWAnBy4dezW6AWnk2TVXx5aPEnnWU8PQuszcuw
time=2026-05-11T00:17:27.457+10:00 level=INFO msg="Task received, starting execution" worker-id=dagu-worker-general-57cdb8d8f9-cx7wx poller-index=95 run-id=3wCTWETWAnBy4dezW6AWnk2TVXx5aPEnnWU8PQuszcuw
time=2026-05-11T00:17:28.959+10:00 level=INFO msg="Executing task" operation=OPERATION_START target=get-run-btbw-job-script run-id=3wCTWETWAnBy4dezW6AWnk2TVXx5aPEnnWU8PQuszcuw root-dag-run-id=019e1230-0f02-76e5-be7f-beefcc69a53e parent-dag-run-id=HnMzpApZiY3DxQnXaeXqPTfenz3iamq1oxwLcNdfCLGt worker-id=dagu-worker-general-57cdb8d8f9-cx7wx
time=2026-05-11T00:17:28.959+10:00 level=INFO msg="Creating temporary DAG file from definition" dag=get-run-btbw-job-script size=986
time=2026-05-11T00:17:28.960+10:00 level=INFO msg="Created temporary DAG file" file=/tmp/dagu/worker-dags/get-run-btbw-job-script-2129259842.yaml
time=2026-05-11T00:17:32.264+10:00 level=INFO msg="CoordinatorCli connection recovered" previous-consecutive-failures=1
time=2026-05-11T00:17:32.346+10:00 level=INFO msg="Distributed task execution finished" operation=OPERATION_START target=/tmp/dagu/worker-dags/get-run-btbw-job-script-2129259842.yaml run-id=3wCTWETWAnBy4dezW6AWnk2TVXx5aPEnnWU8PQuszcuw
time=2026-05-11T00:17:32.346+10:00 level=INFO msg="Task execution completed successfully" worker-id=dagu-worker-general-57cdb8d8f9-cx7wx poller-index=95 run-id=3wCTWETWAnBy4dezW6AWnk2TVXx5aPEnnWU8PQuszcuw
time=2026-05-11T00:25:17.262+10:00 level=WARN msg="Owner coordinator unreachable; cancelling distributed run" worker-id=dagu-worker-general-57cdb8d8f9-cx7wx attempt-key=7c9f9101c1bda17b:4d1242 owner-id=dagu-coordinator-658b985f87-q56wn@50055 host=dagu-coordinator.default.svc.k8s.cluster port=50055
time=2026-05-11T00:25:17.262+10:00 level=INFO msg="Cancelling task per coordinator directive" worker-id=dagu-worker-general-57cdb8d8f9-cx7wx attempt-key=7c9f9101c1bda17b:4d1242
time=2026-05-11T00:25:17.275+10:00 level=ERROR msg="Distributed task execution failed" operation=OPERATION_RETRY target=/tmp/dagu/worker-dags/nightly-backups-1897050239.yaml run-id=019e1230-0f02-76e5-be7f-beefcc69a53e err="command failed: signal: killed"
time=2026-05-11T00:25:17.275+10:00 level=ERROR msg="Task execution failed" worker-id=dagu-worker-general-57cdb8d8f9-cx7wx poller-index=91 run-id=019e1230-0f02-76e5-be7f-beefcc69a53e err="command failed: signal: killed"
time=2026-05-11T00:25:19.263+10:00 level=WARN msg="Owner coordinator unreachable; cancelling distributed run" worker-id=dagu-worker-general-57cdb8d8f9-cx7wx attempt-key=7c9f9101c1bda17b:4d1242 owner-id=dagu-coordinator-658b985f87-q56wn@50055 host=dagu-coordinator.default.svc.k8s.cluster port=50055
time=2026-05-11T00:25:36.755+10:00 level=INFO msg="CoordinatorCli connection recovered" previous-consecutive-failures=9
Coordinator log
time=2026-05-11T00:17:00.307+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4 attempt-key=eb911191a6cf3035:ce5929 err="dag-run ID not found: 8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4"
time=2026-05-11T00:17:00.853+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:01.172+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4 attempt-key=eb911191a6cf3035:ce5929 err="dag-run ID not found: 8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4"
time=2026-05-11T00:17:01.653+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:02.136+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4 attempt-key=eb911191a6cf3035:ce5929 err="dag-run ID not found: 8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4"
time=2026-05-11T00:17:02.662+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:03.253+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4 attempt-key=eb911191a6cf3035:ce5929 err="dag-run ID not found: 8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4"
time=2026-05-11T00:17:03.853+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:04.106+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4 attempt-key=eb911191a6cf3035:ce5929 err="dag-run ID not found: 8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4"
time=2026-05-11T00:17:04.604+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:05.238+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4 attempt-key=eb911191a6cf3035:ce5929 err="dag-run ID not found: 8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4"
time=2026-05-11T00:17:05.753+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:06.204+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4 attempt-key=eb911191a6cf3035:ce5929 err="dag-run ID not found: 8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4"
time=2026-05-11T00:17:06.853+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:07.423+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4 attempt-key=eb911191a6cf3035:ce5929 err="dag-run ID not found: 8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4"
time=2026-05-11T00:17:07.953+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:08.453+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4 attempt-key=eb911191a6cf3035:ce5929 err="dag-run ID not found: 8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4"
time=2026-05-11T00:17:08.653+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:09.253+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4 attempt-key=eb911191a6cf3035:ce5929 err="dag-run ID not found: 8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4"
time=2026-05-11T00:17:09.754+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:10.253+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4 attempt-key=eb911191a6cf3035:ce5929 err="dag-run ID not found: 8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4"
time=2026-05-11T00:17:10.853+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:11.254+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4 attempt-key=eb911191a6cf3035:ce5929 err="dag-run ID not found: 8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4"
time=2026-05-11T00:17:11.853+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:12.353+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4 attempt-key=eb911191a6cf3035:ce5929 err="dag-run ID not found: 8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4"
time=2026-05-11T00:17:12.753+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:13.253+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4 attempt-key=eb911191a6cf3035:ce5929 err="dag-run ID not found: 8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4"
time=2026-05-11T00:17:13.753+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:14.124+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4 attempt-key=eb911191a6cf3035:ce5929 err="dag-run ID not found: 8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4"
time=2026-05-11T00:17:14.655+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:15.190+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4 attempt-key=eb911191a6cf3035:ce5929 err="dag-run ID not found: 8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4"
time=2026-05-11T00:17:15.717+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:16.253+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4 attempt-key=eb911191a6cf3035:ce5929 err="dag-run ID not found: 8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4"
time=2026-05-11T00:17:16.753+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:17.253+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4 attempt-key=eb911191a6cf3035:ce5929 err="dag-run ID not found: 8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4"
time=2026-05-11T00:17:17.753+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:18.253+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4 attempt-key=eb911191a6cf3035:ce5929 err="dag-run ID not found: 8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4"
time=2026-05-11T00:17:18.753+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:19.254+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4 attempt-key=eb911191a6cf3035:ce5929 err="dag-run ID not found: 8cgSWd7xL8PfMCwfg5M6PFQ7dyg4k5XhxQbzQJbUEry4"
time=2026-05-11T00:17:19.753+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:20.753+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:21.057+10:00 level=INFO msg="Handler Dispatch called" run-id=HEPpGG98YrDfNrzesTypmEKGc7yqzYyozkXpT87xG2ZR target=docmost-backup operation=OPERATION_START
time=2026-05-11T00:17:21.261+10:00 level=INFO msg="Handler Dispatch called" run-id=HnMzpApZiY3DxQnXaeXqPTfenz3iamq1oxwLcNdfCLGt target=firefly-backup operation=OPERATION_START
time=2026-05-11T00:17:21.359+10:00 level=INFO msg="Handler Dispatch called" run-id=vxiLF6jFxzJb4u3D5Ep4jRxgNJ61vWYpnTgsz3fLMGW target=nextcloud-backup operation=OPERATION_START
time=2026-05-11T00:17:21.459+10:00 level=INFO msg="Handler Dispatch called" run-id=GAsrU4MD8r1ybb1Xfw3D3YB5WCfNDHB7CyjqVHDTCQf target=forgejo-backup operation=OPERATION_START
time=2026-05-11T00:17:21.854+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:22.354+10:00 level=WARN msg="Failed to resolve latest attempt while checking cancellation" run-id=019e1230-0f02-76e5-be7f-beefcc69a53e attempt-key=7c9f9101c1bda17b:4d1242 err="failed to parse status file: failed to read status file: context canceled"
time=2026-05-11T00:17:22.753+10:00 level=INFO msg="Created sub-DAG attempt for distributed execution" run-id=HnMzpApZiY3DxQnXaeXqPTfenz3iamq1oxwLcNdfCLGt dag=firefly-backup root-dag-run-id=019e1230-0f02-76e5-be7f-beefcc69a53e attempt-key=7adb045f1407f834:43ee08
time=2026-05-11T00:17:22.803+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:23.948+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:24.248+10:00 level=INFO msg="Created sub-DAG attempt for distributed execution" run-id=GAsrU4MD8r1ybb1Xfw3D3YB5WCfNDHB7CyjqVHDTCQf dag=forgejo-backup root-dag-run-id=019e1230-0f02-76e5-be7f-beefcc69a53e attempt-key=306f595b0801f1c9:62c319
time=2026-05-11T00:17:24.392+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=HnMzpApZiY3DxQnXaeXqPTfenz3iamq1oxwLcNdfCLGt attempt-key=7adb045f1407f834:43ee08 err="dag-run ID not found: HnMzpApZiY3DxQnXaeXqPTfenz3iamq1oxwLcNdfCLGt"
time=2026-05-11T00:17:24.547+10:00 level=INFO msg="Created sub-DAG attempt for distributed execution" run-id=vxiLF6jFxzJb4u3D5Ep4jRxgNJ61vWYpnTgsz3fLMGW dag=nextcloud-backup root-dag-run-id=019e1230-0f02-76e5-be7f-beefcc69a53e attempt-key=4d45d03af5b6614d:9a62c1
time=2026-05-11T00:17:24.609+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:24.947+10:00 level=INFO msg="Created sub-DAG attempt for distributed execution" run-id=HEPpGG98YrDfNrzesTypmEKGc7yqzYyozkXpT87xG2ZR dag=docmost-backup root-dag-run-id=019e1230-0f02-76e5-be7f-beefcc69a53e attempt-key=063173100a8bb779:4e7123
time=2026-05-11T00:17:25.039+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=HnMzpApZiY3DxQnXaeXqPTfenz3iamq1oxwLcNdfCLGt attempt-key=7adb045f1407f834:43ee08 err="dag-run ID not found: HnMzpApZiY3DxQnXaeXqPTfenz3iamq1oxwLcNdfCLGt"
time=2026-05-11T00:17:25.040+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=GAsrU4MD8r1ybb1Xfw3D3YB5WCfNDHB7CyjqVHDTCQf attempt-key=306f595b0801f1c9:62c319 err="dag-run ID not found: GAsrU4MD8r1ybb1Xfw3D3YB5WCfNDHB7CyjqVHDTCQf"
time=2026-05-11T00:17:25.449+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=vxiLF6jFxzJb4u3D5Ep4jRxgNJ61vWYpnTgsz3fLMGW attempt-key=4d45d03af5b6614d:9a62c1 err="dag-run ID not found: vxiLF6jFxzJb4u3D5Ep4jRxgNJ61vWYpnTgsz3fLMGW"
time=2026-05-11T00:17:25.695+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:25.853+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=HEPpGG98YrDfNrzesTypmEKGc7yqzYyozkXpT87xG2ZR attempt-key=063173100a8bb779:4e7123 err="dag-run ID not found: HEPpGG98YrDfNrzesTypmEKGc7yqzYyozkXpT87xG2ZR"
time=2026-05-11T00:17:26.152+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=GAsrU4MD8r1ybb1Xfw3D3YB5WCfNDHB7CyjqVHDTCQf attempt-key=306f595b0801f1c9:62c319 err="dag-run ID not found: GAsrU4MD8r1ybb1Xfw3D3YB5WCfNDHB7CyjqVHDTCQf"
time=2026-05-11T00:17:26.249+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=HnMzpApZiY3DxQnXaeXqPTfenz3iamq1oxwLcNdfCLGt attempt-key=7adb045f1407f834:43ee08 err="dag-run ID not found: HnMzpApZiY3DxQnXaeXqPTfenz3iamq1oxwLcNdfCLGt"
time=2026-05-11T00:17:26.748+10:00 level=INFO msg="Handler Dispatch called" run-id=3wCTWETWAnBy4dezW6AWnk2TVXx5aPEnnWU8PQuszcuw target=get-run-btbw-job-script operation=OPERATION_START
time=2026-05-11T00:17:26.851+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:27.148+10:00 level=INFO msg="Created sub-DAG attempt for distributed execution" run-id=3wCTWETWAnBy4dezW6AWnk2TVXx5aPEnnWU8PQuszcuw dag=get-run-btbw-job-script root-dag-run-id=019e1230-0f02-76e5-be7f-beefcc69a53e attempt-key=1d48a6ac462b40cb:79cbac
time=2026-05-11T00:17:27.323+10:00 level=INFO msg="Handler Dispatch called" run-id=7ZkVJi7XXgjjatqzqVv1crE6MocB3FcDwRETPviu1VzJ target=get-run-btbw-job-script operation=OPERATION_START
time=2026-05-11T00:17:27.324+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=GAsrU4MD8r1ybb1Xfw3D3YB5WCfNDHB7CyjqVHDTCQf attempt-key=306f595b0801f1c9:62c319 err="dag-run ID not found: GAsrU4MD8r1ybb1Xfw3D3YB5WCfNDHB7CyjqVHDTCQf"
time=2026-05-11T00:17:27.326+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=vxiLF6jFxzJb4u3D5Ep4jRxgNJ61vWYpnTgsz3fLMGW attempt-key=4d45d03af5b6614d:9a62c1 err="dag-run ID not found: vxiLF6jFxzJb4u3D5Ep4jRxgNJ61vWYpnTgsz3fLMGW"
time=2026-05-11T00:17:27.374+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=HnMzpApZiY3DxQnXaeXqPTfenz3iamq1oxwLcNdfCLGt attempt-key=7adb045f1407f834:43ee08 err="dag-run ID not found: HnMzpApZiY3DxQnXaeXqPTfenz3iamq1oxwLcNdfCLGt"
time=2026-05-11T00:17:27.653+10:00 level=INFO msg="Handler Dispatch called" run-id=FeFZ163ki7SVkNNC9re6Ka6Pk7RyqSaiP24CL2UuNMvz target=get-run-btbw-job-script operation=OPERATION_START
time=2026-05-11T00:17:27.957+10:00 level=INFO msg="Created sub-DAG attempt for distributed execution" run-id=7ZkVJi7XXgjjatqzqVv1crE6MocB3FcDwRETPviu1VzJ dag=get-run-btbw-job-script root-dag-run-id=019e1230-0f02-76e5-be7f-beefcc69a53e attempt-key=af142eff60064f0a:e809cd
time=2026-05-11T00:17:28.061+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=HEPpGG98YrDfNrzesTypmEKGc7yqzYyozkXpT87xG2ZR attempt-key=063173100a8bb779:4e7123 err="dag-run ID not found: HEPpGG98YrDfNrzesTypmEKGc7yqzYyozkXpT87xG2ZR"
time=2026-05-11T00:17:28.271+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=HnMzpApZiY3DxQnXaeXqPTfenz3iamq1oxwLcNdfCLGt attempt-key=7adb045f1407f834:43ee08 err="dag-run ID not found: HnMzpApZiY3DxQnXaeXqPTfenz3iamq1oxwLcNdfCLGt"
time=2026-05-11T00:17:28.354+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=GAsrU4MD8r1ybb1Xfw3D3YB5WCfNDHB7CyjqVHDTCQf attempt-key=306f595b0801f1c9:62c319 err="dag-run ID not found: GAsrU4MD8r1ybb1Xfw3D3YB5WCfNDHB7CyjqVHDTCQf"
time=2026-05-11T00:17:28.354+10:00 level=WARN msg="Failed to resolve latest attempt while checking cancellation" run-id=vxiLF6jFxzJb4u3D5Ep4jRxgNJ61vWYpnTgsz3fLMGW attempt-key=4d45d03af5b6614d:9a62c1 err="failed to parse status file: failed to read status file: context canceled"
time=2026-05-11T00:17:28.354+10:00 level=WARN msg="Failed to resolve latest attempt while checking cancellation" run-id=HEPpGG98YrDfNrzesTypmEKGc7yqzYyozkXpT87xG2ZR attempt-key=063173100a8bb779:4e7123 err="failed to find execution: context canceled"
time=2026-05-11T00:17:28.850+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=3wCTWETWAnBy4dezW6AWnk2TVXx5aPEnnWU8PQuszcuw attempt-key=1d48a6ac462b40cb:79cbac err="dag-run ID not found: 3wCTWETWAnBy4dezW6AWnk2TVXx5aPEnnWU8PQuszcuw"
time=2026-05-11T00:17:29.054+10:00 level=INFO msg="Created sub-DAG attempt for distributed execution" run-id=FeFZ163ki7SVkNNC9re6Ka6Pk7RyqSaiP24CL2UuNMvz dag=get-run-btbw-job-script root-dag-run-id=019e1230-0f02-76e5-be7f-beefcc69a53e attempt-key=1f06c887a9f2a461:150db2
time=2026-05-11T00:17:29.111+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=HEPpGG98YrDfNrzesTypmEKGc7yqzYyozkXpT87xG2ZR attempt-key=063173100a8bb779:4e7123 err="dag-run ID not found: HEPpGG98YrDfNrzesTypmEKGc7yqzYyozkXpT87xG2ZR"
time=2026-05-11T00:17:29.162+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=GAsrU4MD8r1ybb1Xfw3D3YB5WCfNDHB7CyjqVHDTCQf attempt-key=306f595b0801f1c9:62c319 err="dag-run ID not found: GAsrU4MD8r1ybb1Xfw3D3YB5WCfNDHB7CyjqVHDTCQf"
time=2026-05-11T00:17:29.210+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=HnMzpApZiY3DxQnXaeXqPTfenz3iamq1oxwLcNdfCLGt attempt-key=7adb045f1407f834:43ee08 err="dag-run ID not found: HnMzpApZiY3DxQnXaeXqPTfenz3iamq1oxwLcNdfCLGt"
time=2026-05-11T00:17:29.363+10:00 level=INFO msg="Handler Dispatch called" run-id=5Qe48C2YDbMVxfSzCwFiULiSq5P9aC8jg2zA7tSUnSK4 target=get-run-btbw-job-script operation=OPERATION_START
time=2026-05-11T00:17:29.553+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=7ZkVJi7XXgjjatqzqVv1crE6MocB3FcDwRETPviu1VzJ attempt-key=af142eff60064f0a:e809cd err="dag-run ID not found: 7ZkVJi7XXgjjatqzqVv1crE6MocB3FcDwRETPviu1VzJ"
time=2026-05-11T00:17:29.553+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T00:17:30.054+10:00 level=WARN msg="Failed to repair stale distributed run failure after heartbeat" run-id=vxiLF6jFxzJb4u3D5Ep4jRxgNJ61vWYpnTgsz3fLMGW attempt-key=4d45d03af5b6614d:9a62c1 err="dag-run ID not found: vxiLF6jFxzJb4u3D5Ep4jRxgNJ61vWYpnTgsz3fLMGW"
In same log spamming the sub-dags that are still have a running step, but workflow already failed...:
time=2026-05-11T10:26:07.503+10:00 level=ERROR msg="Failed to confirm stale distributed run from lease reconciliation" dag=nexus-backup run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T10:26:07.753+10:00 level=ERROR msg="Failed to confirm stale distributed run from lease reconciliation" dag=forgejo-backup run-id=GAsrU4MD8r1ybb1Xfw3D3YB5WCfNDHB7CyjqVHDTCQf attempt-key=306f595b0801f1c9:62c319 err="dag-run ID not found: GAsrU4MD8r1ybb1Xfw3D3YB5WCfNDHB7CyjqVHDTCQf"
time=2026-05-11T10:26:07.953+10:00 level=ERROR msg="Failed to confirm stale distributed run from lease reconciliation" dag=nextcloud-backup run-id=vxiLF6jFxzJb4u3D5Ep4jRxgNJ61vWYpnTgsz3fLMGW attempt-key=4d45d03af5b6614d:9a62c1 err="dag-run ID not found: vxiLF6jFxzJb4u3D5Ep4jRxgNJ61vWYpnTgsz3fLMGW"
time=2026-05-11T10:26:08.153+10:00 level=ERROR msg="Failed to confirm stale indexed distributed run" dag=nexus-backup run-id=BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J attempt-key=3c10bb42469723e6:6a9694 err="dag-run ID not found: BB6bQgMArwQ9zw3tQLNa4NPeDnd6rZNAjC3oyK9Hk8J"
time=2026-05-11T10:26:08.353+10:00 level=ERROR msg="Failed to confirm stale indexed distributed run" dag=forgejo-backup run-id=GAsrU4MD8r1ybb1Xfw3D3YB5WCfNDHB7CyjqVHDTCQf attempt-key=306f595b0801f1c9:62c319 err="dag-run ID not found: GAsrU4MD8r1ybb1Xfw3D3YB5WCfNDHB7CyjqVHDTCQf"
time=2026-05-11T10:26:08.454+10:00 level=ERROR msg="Failed to confirm stale indexed distributed run" dag=nextcloud-backup run-id=vxiLF6jFxzJb4u3D5Ep4jRxgNJ61vWYpnTgsz3fLMGW attempt-key=4d45d03af5b6614d:9a62c1 err="dag-run ID not found: vxiLF6jFxzJb4u3D5Ep4jRxgNJ61vWYpnTgsz3fLMGW"
Describe the bug
Dagu version 2.6.5, 5 workers on k8s
Sub diag fails with:
Although the workflow itself is correctly shown as 'failed', sub- diag itself still shows a step as 'running'
To Reproduce
Steps to reproduce the behavior:
This is happening with workflow which has sub-dags and used to work fine. Also stating that this is NOT
consistently reproduceable!
Expected behavior
Workflow with sub-dags successfully completes.
Actual behavior
Workflow fails with several sub-dags showing 'distributed run lease expired' errors, and each sub diag shows a step still running and unable to stop...
Environment
DAG configuration
If applicable, paste the relevant DAG YAML (redact any sensitive information):
Screenshots
If applicable, add screenshots to help explain your problem.
Additional context
Add any other context about the problem here.
Worker log
Coordinator log
In same log spamming the sub-dags that are still have a running step, but workflow already failed...: