Skip to content

[v2.14.1] fetchPageResources awaiting backOff.recurse blocks page render up to ~71s on transient backend errors #17819

@vvlisn

Description

@vvlisn

Setup

  • Rancher version: v2.14.1
  • Rancher UI Extensions: none (stock dashboard, release-2.14)
  • Browser type & version: Chrome 140 (also reproduced on Safari 17.5)

Describe the bug

In Rancher Dashboard v2.14.1, list pages that go through fetchPageResources (Cluster Explorer → Workloads → Deployments, and other pages backed by paginated Steve list calls) can stay blank for up to ~71 seconds after a transient backend error, even after the backend has fully recovered.

The store action fetchPageResources in shell/plugins/steve/subscribe.js awaits backOff.recurse(...). The recurse method (newly added in v2.14.x in shell/utils/back-off.ts, alongside the existing non-blocking execute) runs an internal retry loop with await this.sleep(this.calcDelay(i)) between attempts. Because the caller is awaiting that promise, the awaited fetch path is blocked for the full retry budget rather than yielding to a background reconciliation.

With the default calcDelay(i) = i === 0 ? 1 : Math.pow(i, 2) * 250 and retries=10, the cumulative blocking time can reach approximately:

1 + 250 + 1000 + 2250 + 4000 + 6250 + 9000 + 12250 + 16000 + 20250  ≈  71 s

Additionally, the ws.resource.changes handler was changed from fire-and-forget to await-style in the same release, which feeds the same blocking chain.

In v2.13 the equivalent code path used backOff.execute() (non-blocking, setTimeout based), so the same backend symptom did not block the initial render to the same degree.

To Reproduce

  1. Open Rancher v2.14.1 Dashboard against any cluster.
  2. Cause a transient backend error on a Steve list endpoint for ~10–20 seconds (any 5xx, context canceled, or unknown-revision response). Easiest synthetic test: route Steve list responses through a local proxy that returns 503 for ~15 seconds.
  3. While the backend is failing, navigate to Cluster Explorer → Workloads → Deployments (or any list page that uses fetchPageResources).
  4. Stop the proxy / let the backend recover.

Result

The Deployments page stays blank — no skeleton, no stale snapshot, no error toast — for tens of seconds after the backend has already started returning 200 again. Observed worst case in production: ~100 seconds blank with 7+ retries visible in the network panel.

The DevTools network panel shows successive GET /v1/apps.deployments?... calls separated by the calcDelay schedule above, all initiated from the same awaited fetchPageResources promise.

Expected Result

Either the page renders quickly with a stale snapshot + a non-blocking "retrying" indicator, or fetchPageResources resolves within a user-tolerable time on transient errors (consistent with v2.13 behavior).

Screenshots

N/A — reproducible from the network panel; no UI artifacts to capture beyond the blank page.

Additional context

We hit this in production on v2.14.1 against a large downstream cluster where cattle-cluster-agent (in steady-state operation, not restarting) was returning begin tx: context canceled on Steve listbyoptions calls — a separate steve / sqlcache writer-lock issue on large clusters with ui-sql-cache=true. The trigger we observe is cluster size: smaller downstream clusters in the same Rancher do not reproduce. The dashboard request for apps/v1, Kind=Deployment was returned with context canceled by Steve, the frontend then entered recurse, and the Deployments page stayed blank for ~100 seconds before finally rendering.

Code references (rancher/dashboard release-2.14):

  • shell/plugins/steve/subscribe.jsfetchPageResources action awaits backOff.recurse(...)
  • shell/utils/back-off.tsrecurse method, calcDelay formula, default retries=10
  • shell/plugins/steve/subscribe.jsws.resource.changes handler is now awaited (changed from fire-and-forget in v2.13)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions