[v2.14.1] fetchPageResources awaiting backOff.recurse blocks page render up to ~71s on transient backend errors

**Setup**
- Rancher version: v2.14.1
- Rancher UI Extensions: none (stock dashboard, release-2.14)
- Browser type & version: Chrome 140 (also reproduced on Safari 17.5)

**Describe the bug**

In Rancher Dashboard v2.14.1, list pages that go through `fetchPageResources` (Cluster Explorer → Workloads → Deployments, and other pages backed by paginated Steve list calls) can stay blank for up to ~71 seconds after a transient backend error, even after the backend has fully recovered.

The store action `fetchPageResources` in [`shell/plugins/steve/subscribe.js`](https://github.com/rancher/dashboard/blob/release-2.14/shell/plugins/steve/subscribe.js) `await`s `backOff.recurse(...)`. The `recurse` method (newly added in v2.14.x in [`shell/utils/back-off.ts`](https://github.com/rancher/dashboard/blob/release-2.14/shell/utils/back-off.ts), alongside the existing non-blocking `execute`) runs an internal retry loop with `await this.sleep(this.calcDelay(i))` between attempts. Because the caller is awaiting that promise, the awaited fetch path is blocked for the full retry budget rather than yielding to a background reconciliation.

With the default `calcDelay(i) = i === 0 ? 1 : Math.pow(i, 2) * 250` and `retries=10`, the cumulative blocking time can reach approximately:

```
1 + 250 + 1000 + 2250 + 4000 + 6250 + 9000 + 12250 + 16000 + 20250  ≈  71 s
```

Additionally, the `ws.resource.changes` handler was changed from fire-and-forget to `await`-style in the same release, which feeds the same blocking chain.

In v2.13 the equivalent code path used `backOff.execute()` (non-blocking, `setTimeout` based), so the same backend symptom did not block the initial render to the same degree.

**To Reproduce**

1. Open Rancher v2.14.1 Dashboard against any cluster.
2. Cause a transient backend error on a Steve list endpoint for ~10–20 seconds (any 5xx, `context canceled`, or unknown-revision response). Easiest synthetic test: route Steve list responses through a local proxy that returns 503 for ~15 seconds.
3. While the backend is failing, navigate to Cluster Explorer → Workloads → Deployments (or any list page that uses `fetchPageResources`).
4. Stop the proxy / let the backend recover.

**Result**

The Deployments page stays blank — no skeleton, no stale snapshot, no error toast — for tens of seconds after the backend has already started returning 200 again. Observed worst case in production: ~100 seconds blank with 7+ retries visible in the network panel.

The DevTools network panel shows successive `GET /v1/apps.deployments?...` calls separated by the `calcDelay` schedule above, all initiated from the same awaited `fetchPageResources` promise.

**Expected Result**

Either the page renders quickly with a stale snapshot + a non-blocking "retrying" indicator, or `fetchPageResources` resolves within a user-tolerable time on transient errors (consistent with v2.13 behavior).

**Screenshots**

N/A — reproducible from the network panel; no UI artifacts to capture beyond the blank page.

**Additional context**

We hit this in production on v2.14.1 against a large downstream cluster where `cattle-cluster-agent` (in steady-state operation, not restarting) was returning `begin tx: context canceled` on Steve `listbyoptions` calls — a separate steve / sqlcache writer-lock issue on large clusters with `ui-sql-cache=true`. The trigger we observe is cluster size: smaller downstream clusters in the same Rancher do not reproduce. The dashboard request for `apps/v1, Kind=Deployment` was returned with `context canceled` by Steve, the frontend then entered `recurse`, and the Deployments page stayed blank for ~100 seconds before finally rendering.

Code references (rancher/dashboard release-2.14):

- `shell/plugins/steve/subscribe.js` — `fetchPageResources` action `await`s `backOff.recurse(...)`
- `shell/utils/back-off.ts` — `recurse` method, `calcDelay` formula, default `retries=10`
- `shell/plugins/steve/subscribe.js` — `ws.resource.changes` handler is now awaited (changed from fire-and-forget in v2.13)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v2.14.1] fetchPageResources awaiting backOff.recurse blocks page render up to ~71s on transient backend errors #17819

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[v2.14.1] fetchPageResources awaiting backOff.recurse blocks page render up to ~71s on transient backend errors #17819

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions