Setup
- Rancher version: v2.14.1
- Rancher UI Extensions: none (stock dashboard, release-2.14)
- Browser type & version: Chrome 140 (also reproduced on Safari 17.5)
Describe the bug
In Rancher Dashboard v2.14.1, list pages that go through fetchPageResources (Cluster Explorer → Workloads → Deployments, and other pages backed by paginated Steve list calls) can stay blank for up to ~71 seconds after a transient backend error, even after the backend has fully recovered.
The store action fetchPageResources in shell/plugins/steve/subscribe.js awaits backOff.recurse(...). The recurse method (newly added in v2.14.x in shell/utils/back-off.ts, alongside the existing non-blocking execute) runs an internal retry loop with await this.sleep(this.calcDelay(i)) between attempts. Because the caller is awaiting that promise, the awaited fetch path is blocked for the full retry budget rather than yielding to a background reconciliation.
With the default calcDelay(i) = i === 0 ? 1 : Math.pow(i, 2) * 250 and retries=10, the cumulative blocking time can reach approximately:
1 + 250 + 1000 + 2250 + 4000 + 6250 + 9000 + 12250 + 16000 + 20250 ≈ 71 s
Additionally, the ws.resource.changes handler was changed from fire-and-forget to await-style in the same release, which feeds the same blocking chain.
In v2.13 the equivalent code path used backOff.execute() (non-blocking, setTimeout based), so the same backend symptom did not block the initial render to the same degree.
To Reproduce
- Open Rancher v2.14.1 Dashboard against any cluster.
- Cause a transient backend error on a Steve list endpoint for ~10–20 seconds (any 5xx,
context canceled, or unknown-revision response). Easiest synthetic test: route Steve list responses through a local proxy that returns 503 for ~15 seconds.
- While the backend is failing, navigate to Cluster Explorer → Workloads → Deployments (or any list page that uses
fetchPageResources).
- Stop the proxy / let the backend recover.
Result
The Deployments page stays blank — no skeleton, no stale snapshot, no error toast — for tens of seconds after the backend has already started returning 200 again. Observed worst case in production: ~100 seconds blank with 7+ retries visible in the network panel.
The DevTools network panel shows successive GET /v1/apps.deployments?... calls separated by the calcDelay schedule above, all initiated from the same awaited fetchPageResources promise.
Expected Result
Either the page renders quickly with a stale snapshot + a non-blocking "retrying" indicator, or fetchPageResources resolves within a user-tolerable time on transient errors (consistent with v2.13 behavior).
Screenshots
N/A — reproducible from the network panel; no UI artifacts to capture beyond the blank page.
Additional context
We hit this in production on v2.14.1 against a large downstream cluster where cattle-cluster-agent (in steady-state operation, not restarting) was returning begin tx: context canceled on Steve listbyoptions calls — a separate steve / sqlcache writer-lock issue on large clusters with ui-sql-cache=true. The trigger we observe is cluster size: smaller downstream clusters in the same Rancher do not reproduce. The dashboard request for apps/v1, Kind=Deployment was returned with context canceled by Steve, the frontend then entered recurse, and the Deployments page stayed blank for ~100 seconds before finally rendering.
Code references (rancher/dashboard release-2.14):
shell/plugins/steve/subscribe.js — fetchPageResources action awaits backOff.recurse(...)
shell/utils/back-off.ts — recurse method, calcDelay formula, default retries=10
shell/plugins/steve/subscribe.js — ws.resource.changes handler is now awaited (changed from fire-and-forget in v2.13)
Setup
Describe the bug
In Rancher Dashboard v2.14.1, list pages that go through
fetchPageResources(Cluster Explorer → Workloads → Deployments, and other pages backed by paginated Steve list calls) can stay blank for up to ~71 seconds after a transient backend error, even after the backend has fully recovered.The store action
fetchPageResourcesinshell/plugins/steve/subscribe.jsawaitsbackOff.recurse(...). Therecursemethod (newly added in v2.14.x inshell/utils/back-off.ts, alongside the existing non-blockingexecute) runs an internal retry loop withawait this.sleep(this.calcDelay(i))between attempts. Because the caller is awaiting that promise, the awaited fetch path is blocked for the full retry budget rather than yielding to a background reconciliation.With the default
calcDelay(i) = i === 0 ? 1 : Math.pow(i, 2) * 250andretries=10, the cumulative blocking time can reach approximately:Additionally, the
ws.resource.changeshandler was changed from fire-and-forget toawait-style in the same release, which feeds the same blocking chain.In v2.13 the equivalent code path used
backOff.execute()(non-blocking,setTimeoutbased), so the same backend symptom did not block the initial render to the same degree.To Reproduce
context canceled, or unknown-revision response). Easiest synthetic test: route Steve list responses through a local proxy that returns 503 for ~15 seconds.fetchPageResources).Result
The Deployments page stays blank — no skeleton, no stale snapshot, no error toast — for tens of seconds after the backend has already started returning 200 again. Observed worst case in production: ~100 seconds blank with 7+ retries visible in the network panel.
The DevTools network panel shows successive
GET /v1/apps.deployments?...calls separated by thecalcDelayschedule above, all initiated from the same awaitedfetchPageResourcespromise.Expected Result
Either the page renders quickly with a stale snapshot + a non-blocking "retrying" indicator, or
fetchPageResourcesresolves within a user-tolerable time on transient errors (consistent with v2.13 behavior).Screenshots
N/A — reproducible from the network panel; no UI artifacts to capture beyond the blank page.
Additional context
We hit this in production on v2.14.1 against a large downstream cluster where
cattle-cluster-agent(in steady-state operation, not restarting) was returningbegin tx: context canceledon Stevelistbyoptionscalls — a separate steve / sqlcache writer-lock issue on large clusters withui-sql-cache=true. The trigger we observe is cluster size: smaller downstream clusters in the same Rancher do not reproduce. The dashboard request forapps/v1, Kind=Deploymentwas returned withcontext canceledby Steve, the frontend then enteredrecurse, and the Deployments page stayed blank for ~100 seconds before finally rendering.Code references (rancher/dashboard release-2.14):
shell/plugins/steve/subscribe.js—fetchPageResourcesactionawaitsbackOff.recurse(...)shell/utils/back-off.ts—recursemethod,calcDelayformula, defaultretries=10shell/plugins/steve/subscribe.js—ws.resource.changeshandler is now awaited (changed from fire-and-forget in v2.13)