feat(webui): failed sectors dashboard with pipeline stage and failure details#995
Closed
feat(webui): failed sectors dashboard with pipeline stage and failure details#995
Conversation
e6f3f30 to
fb54ac3
Compare
LexLuthr
approved these changes
Feb 16, 2026
Contributor
LexLuthr
left a comment
There was a problem hiding this comment.
Please attach screenshots
Contributor
Author
Contributor
|
Can you please fail some sectors and get me some screenshots from that page. You can use devnet for testing the messing around. I use the same for easier UI testing and screenshots. |
Add a new 'Failed Sectors' page to the Curio web UI that shows sectors which have failed during the sealing pipeline, with retry details. Backend (web/api/webrpc/pipeline_failed.go): - New FailedSectorDetail struct with pipeline stage booleans and failure info - New PipelineFailedSectors RPC method querying sectors_sdr_pipeline WHERE failed = true, ordered by failed_at DESC, limited to 100 Frontend (web/static/pages/pipeline_failed/): - index.html: page shell using curio-ux wrapper - pipeline-failed.mjs: Lit component with 10s auto-refresh showing: - Miner ID, sector number, failure timestamp, last completed stage - Failure reason and details (truncated with tooltip) - Sector age since creation - Color-coded rows: red for <1h, orange for <24h failures - Green success message when no sectors have failed Navigation (web/static/ux/curio-ux.mjs): - Added 'Failed Sectors' nav item with warning triangle icon after PoRep
fb54ac3 to
5672e8d
Compare
The Failed Sectors tab was only showing sectors with failed=true, which is set in very few cases (precommit-check, past-start-epoch, alloc-check). Most real failures (CommitMsg failure, PoRep crash, etc.) cause the poller to reset pipeline flags for retry, but the harmony_task gets cleaned up — leaving a dangling task_id reference. These sectors are stuck in a retry loop and never progress. Now detects both: - Terminal failures (failed=true) — shown as FAILED (red) - Stuck sectors with missing tasks — shown as STUCK (amber) The query checks for task_id references that point to tasks no longer in harmony_task, matching the same logic the pipeline view uses for its FAILED badges.
Replace flat 100-row list with server-side grouped summary: - PipelineFailedSectors returns groups by (status, stage, reason) with counts - PipelineFailedSectorDetails returns paginated sectors per group - No hard cap on total sectors — summary query is O(groups) not O(sectors) - Click to expand group, lazy-loads first page, 'load more' for pagination - Configurable page size (100/250/500) - Scales to 10k+ sectors without payload bloat
Contributor
Author
|
Closing this — after testing, the existing Alert Manager combined with the pipeline dashboard already surfaces stuck/failed sectors well enough. Adding a separate tab creates redundancy without enough operational value to justify the extra surface area. May revisit if the need comes back. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Motivation
When sectors fail in the sealing pipeline, operators currently have to SSH into the database and manually query
sectors_sdr_pipelineto understand what went wrong, at what stage, and how long sectors have been stuck. This is one of the most common day-2 operational pain points — especially for operators running hundreds of sectors where failures are routine.This was identified during a codebase audit as a high-impact operator experience improvement.
Changes
1. Backend: New RPC method
PipelineFailedSectorsNew file
web/api/webrpc/pipeline_failed.goadds a WebRPC method that queries the 100 most recent failed sectors fromsectors_sdr_pipeline, returning:after_*boolean flags showing exactly how far the sector got)2. Frontend: New "Failed Sectors" page
New page at
/pages/pipeline_failed/with a Lit web component that:PipelineFailedSectorsevery 10 secondsafter_*flags to show the last completed pipeline stage (SDR → TreeD → TreeC → TreeR → PrecommitMsg → PoRep → Finalize → MoveStorage → CommitMsg)3. Nav menu entry
Added "Failed Sectors" with a warning triangle icon to the sidebar, positioned after "PoRep" for natural pipeline flow.
No new dependencies
Uses existing WebRPC infrastructure (auto-registered via go-jsonrpc reflection), Lit from CDN (same as all other pages), and Bootstrap 5 dark theme.