Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions documentation/en/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@
* [Garbage Collection](garbage-collection.md)
* [Best Practices](best-practices.md)
* [Administration](administration/README.md)
* [Node Maintenance & Cordoning](administration/node-maintenance.md)
* [YugabyteDB Backup](administration/yugabyte-backup.md)
* [YugabyteDB troubleshooting](administration/yugabyte-troubleshooting.md)
* [Indexing & CheckIndex troubleshooting](administration/indexing-checkindex-troubleshooting.md)
Expand Down
1 change: 1 addition & 0 deletions documentation/en/administration/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ This section covers administrative tasks for maintaining and managing your Curio

## Guides

* [Node Maintenance & Cordoning](node-maintenance.md) - How to safely cordon nodes for maintenance without disrupting sealing pipelines
* [YugabyteDB Backup](yugabyte-backup.md) - How to backup and restore your database
* [YugabyteDB Troubleshooting](yugabyte-troubleshooting.md) - Common YB migration/runtime issues and how to resolve them
* [Indexing / CheckIndex Troubleshooting](indexing-checkindex-troubleshooting.md) - Diagnose deal indexing and CheckIndex pipeline errors
78 changes: 78 additions & 0 deletions documentation/en/administration/node-maintenance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
---
description: >-
How to safely cordon Curio nodes for maintenance without disrupting in-progress sealing pipelines.
---

# Node Maintenance & Cordoning

## What cordoning does

Running `curio cordon` (or setting `unschedulable = true` in the WebUI) tells the Harmony scheduler to **stop scheduling new tasks** on a node. Tasks that are already running will finish, but no new work will be picked up.

This is intentionally simple: cordon → wait for running tasks to finish → do maintenance → restart → uncordon.

## How cordoning affects the sealing pipeline

Cordoning blocks **all** new task scheduling on the node, including pipeline continuation tasks like TreeD, TreeRC, SyntheticProofs, and Finalize. If a sector's data is **location-bound** to the cordoned node (the sector cache lives on local storage), those follow-up tasks cannot run on another node either.

**What this means in practice:**

- If you cordon a node that has sectors mid-pipeline (e.g., SDR completed but TreeRC not yet started), those sectors will be **paused** until the node is uncordoned.
- The sectors are not lost — they will resume once the node is uncordoned and the scheduler picks them up again.
- However, if the node stays cordoned for too long, sectors may **expire** (miss their precommit deadline), wasting the SDR work.

{% hint style="warning" %}
**Non-batched sealing operators:** SDR takes many hours. If you cordon a node and leave it cordoned past the sector's precommit deadline, that SDR work is lost. Plan maintenance windows accordingly.
{% endhint %}

## Recommended workflows

### Quick maintenance (restart, upgrade, config change)

For short interruptions where downtime is minutes, not hours:

1. **Check the pipeline** — in the WebUI, verify what stages are in progress on the node.
2. `curio cordon <node>` — stop new work from being scheduled.
3. **Wait** for currently-running tasks to complete (watch the WebUI or logs).
4. Do your maintenance (restart, upgrade, etc.).
5. `curio uncordon <node>` — resume scheduling.

Pipeline sectors that were paused (waiting for their next stage) will resume automatically after uncordon.

### Planned extended maintenance (hours)

If the node will be down for an extended period:

1. **Stop starting new sectors** first — pause deal intake or CC sector creation so no new SDR work begins on the node.
2. **Wait for in-progress pipelines to clear** — let sectors finish through Finalize before cordoning. Monitor via the pipeline view in the WebUI.
3. `curio cordon <node>` once pipelines are drained.
4. Do your maintenance.
5. `curio uncordon <node>` when ready.

### Decommissioning a node or long outage

If you know a node will be offline for a long time (longer than precommit deadlines):

1. **Attach the node's storage to another node** — use `curio attach` on a different machine to make the sector data accessible elsewhere.
2. Other nodes with access to the storage paths can then pick up the remaining pipeline tasks.
3. Cordon and shut down the original node.

See [Storage Configuration](../storage-configuration.md) for details on attaching storage.

## Key points to remember

| Aspect | Behavior |
|--------|----------|
| Running tasks | Finish normally on the cordoned node |
| New task scheduling | Blocked — no new work starts |
| Location-bound pipeline tasks | Paused until uncordon (data is on local storage, other nodes can't run them) |
| Sector data | Safe — nothing is deleted or moved by cordoning |
| Precommit deadlines | Still apply — sectors can expire if cordoned too long |

## Common mistakes

- **Cordoning during active SDR without a plan to uncordon quickly.** SDR is the longest pipeline stage. If you cordon right after SDR completes, the follow-up stages (TreeD, TreeRC, etc.) are blocked. Either wait for the full pipeline to clear first, or ensure you'll uncordon in time.

- **Forgetting to uncordon after maintenance.** The node will sit idle, and any paused pipelines will remain stuck. Set a reminder or check the WebUI after maintenance.

- **Cordoning when you should be migrating storage.** If the node is going away permanently, cordon alone won't help — you need to move or re-attach the storage paths so other nodes can access the sector data.
13 changes: 13 additions & 0 deletions documentation/en/troubleshooting/common-errors.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,19 @@ Code reference (for maintainers): `tasks/piece/task_park_piece.go`.

---

### Pipeline

#### Sectors stuck mid-pipeline after cordoning a node

What it means:
- Cordoning blocks all new task scheduling, including location-bound pipeline stages (TreeD, TreeRC, SyntheticProofs, Finalize). If sector data lives on the cordoned node's local storage, no other node can pick up those tasks.

What to do:
- Uncordon the node to let pipelines resume, or attach storage to another node.
- See: [Node Maintenance & Cordoning](../administration/node-maintenance.md) for recommended workflows.

---

### Batch sealing

#### `panic: SupraSealInit: supraseal build tag not enabled`
Expand Down
Loading