Skip to content

Comments

feat(webui): failed sectors dashboard with pipeline stage and failure details#995

Closed
Reiers wants to merge 3 commits intomainfrom
feat/pipeline-retry-dashboard
Closed

feat(webui): failed sectors dashboard with pipeline stage and failure details#995
Reiers wants to merge 3 commits intomainfrom
feat/pipeline-retry-dashboard

Conversation

@Reiers
Copy link
Contributor

@Reiers Reiers commented Feb 15, 2026

Motivation

When sectors fail in the sealing pipeline, operators currently have to SSH into the database and manually query sectors_sdr_pipeline to understand what went wrong, at what stage, and how long sectors have been stuck. This is one of the most common day-2 operational pain points — especially for operators running hundreds of sectors where failures are routine.

This was identified during a codebase audit as a high-impact operator experience improvement.

Changes

1. Backend: New RPC method PipelineFailedSectors

New file web/api/webrpc/pipeline_failed.go adds a WebRPC method that queries the 100 most recent failed sectors from sectors_sdr_pipeline, returning:

  • Sector identity (SP ID, sector number)
  • Failure details (reason code, full error message, failure timestamp)
  • Pipeline stage progression (all after_* boolean flags showing exactly how far the sector got)

2. Frontend: New "Failed Sectors" page

New page at /pages/pipeline_failed/ with a Lit web component that:

  • Polls PipelineFailedSectors every 10 seconds
  • Shows a table with columns: Miner, Sector #, Failed At, Stage, Reason, Details, Age
  • Stage is computed from the after_* flags to show the last completed pipeline stage (SDR → TreeD → TreeC → TreeR → PrecommitMsg → PoRep → Finalize → MoveStorage → CommitMsg)
  • Color-coded rows: red tint for failures <1h old, orange for <24h, default for older
  • Truncated error messages with full text on hover (tooltip)
  • Shows "No failed sectors 🎉" when the pipeline is clean

3. Nav menu entry

Added "Failed Sectors" with a warning triangle icon to the sidebar, positioned after "PoRep" for natural pipeline flow.

No new dependencies

Uses existing WebRPC infrastructure (auto-registered via go-jsonrpc reflection), Lit from CDN (same as all other pages), and Bootstrap 5 dark theme.

@Reiers Reiers requested a review from a team as a code owner February 15, 2026 20:41
@Reiers Reiers force-pushed the feat/pipeline-retry-dashboard branch from e6f3f30 to fb54ac3 Compare February 15, 2026 20:46
Copy link
Contributor

@LexLuthr LexLuthr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please attach screenshots

@Reiers
Copy link
Contributor Author

Reiers commented Feb 16, 2026

Please attach screenshots

Don't have any failed sectors right now, so its working... maybe
Skjermbilde 2026-02-16 kl  15 13 08

How it looks on the meny side bar:
Skjermbilde 2026-02-16 kl  15 13 44

@LexLuthr
Copy link
Contributor

Can you please fail some sectors and get me some screenshots from that page. You can use devnet for testing the messing around. I use the same for easier UI testing and screenshots.

Add a new 'Failed Sectors' page to the Curio web UI that shows sectors
which have failed during the sealing pipeline, with retry details.

Backend (web/api/webrpc/pipeline_failed.go):
- New FailedSectorDetail struct with pipeline stage booleans and failure info
- New PipelineFailedSectors RPC method querying sectors_sdr_pipeline
  WHERE failed = true, ordered by failed_at DESC, limited to 100

Frontend (web/static/pages/pipeline_failed/):
- index.html: page shell using curio-ux wrapper
- pipeline-failed.mjs: Lit component with 10s auto-refresh showing:
  - Miner ID, sector number, failure timestamp, last completed stage
  - Failure reason and details (truncated with tooltip)
  - Sector age since creation
  - Color-coded rows: red for <1h, orange for <24h failures
  - Green success message when no sectors have failed

Navigation (web/static/ux/curio-ux.mjs):
- Added 'Failed Sectors' nav item with warning triangle icon after PoRep
@Reiers Reiers force-pushed the feat/pipeline-retry-dashboard branch from fb54ac3 to 5672e8d Compare February 18, 2026 10:47
The Failed Sectors tab was only showing sectors with failed=true, which is
set in very few cases (precommit-check, past-start-epoch, alloc-check).

Most real failures (CommitMsg failure, PoRep crash, etc.) cause the poller
to reset pipeline flags for retry, but the harmony_task gets cleaned up —
leaving a dangling task_id reference. These sectors are stuck in a retry
loop and never progress.

Now detects both:
- Terminal failures (failed=true) — shown as FAILED (red)
- Stuck sectors with missing tasks — shown as STUCK (amber)

The query checks for task_id references that point to tasks no longer in
harmony_task, matching the same logic the pipeline view uses for its
FAILED badges.
Replace flat 100-row list with server-side grouped summary:
- PipelineFailedSectors returns groups by (status, stage, reason) with counts
- PipelineFailedSectorDetails returns paginated sectors per group
- No hard cap on total sectors — summary query is O(groups) not O(sectors)
- Click to expand group, lazy-loads first page, 'load more' for pagination
- Configurable page size (100/250/500)
- Scales to 10k+ sectors without payload bloat
@Reiers
Copy link
Contributor Author

Reiers commented Feb 18, 2026

Closing this — after testing, the existing Alert Manager combined with the pipeline dashboard already surfaces stuck/failed sectors well enough. Adding a separate tab creates redundancy without enough operational value to justify the extra surface area. May revisit if the need comes back.

@Reiers Reiers closed this Feb 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants