-
Notifications
You must be signed in to change notification settings - Fork 46
Description
Problem
When a node is cordoned (unschedulable=true), the Harmony scheduler stops picking up all new tasks, including the next stage for sectors already mid-pipeline on that machine.
Example: a sector finishes SDR on a cordoned node → TreeD won't be scheduled, even though the cache files are on this machine and no other node can do the work. The pipeline is stuck.
The only task with a workaround is Finalize, which has SchedulingOverrides for batch tasks.
Expected Behavior
Cordon should mean "stop accepting new work, but finish what you started." Specifically:
- New pipelines (new SDR tasks, new CC sectors via IAmBored) → blocked ✅ (works today)
- Running tasks → allowed to complete ✅ (works today)
- Next stage for in-progress pipelines on this node → should be allowed ❌ (broken today)
Current Workaround
Disable early-stage tasks individually (EnableSealSDR = false, etc.) while keeping later stages enabled. This allows in-progress pipelines to drain but is manual and error-prone.
Root Cause
In pollerTryAllWork(), when schedulable=false:
- All handlers without
SchedulingOverridesare skipped followWorkInDB()is also skipped (thecontinueon line ~299)- The SealPoller still creates tasks in DB, but the Harmony poller won't claim them
Possible Approach
Extend the SchedulingOverrides pattern: when cordoned, allow scheduling a task if the sector's data already resides on this machine (i.e., an earlier pipeline stage was completed here). This could be implemented by:
- Checking if any related pipeline task was previously completed by this node (via
harmony_task_history) - Or checking if the sector's storage paths are on this machine
- Or adding a pipeline-aware "drain mode" flag that's distinct from full cordon
Related: pipeline-aware scheduling / anti-starvation improvements.