Skip to content

QVAC-18935 test[skiplog]: retry on RequestRejectedByPolicyError in logging tests#2065

Merged
Victor-Rodzko merged 3 commits into
mainfrom
test/QVAC-18935-logging-flake-retry-policy-rejection
May 14, 2026
Merged

QVAC-18935 test[skiplog]: retry on RequestRejectedByPolicyError in logging tests#2065
Victor-Rodzko merged 3 commits into
mainfrom
test/QVAC-18935-logging-flake-retry-policy-rejection

Conversation

@Victor-Rodzko
Copy link
Copy Markdown
Contributor

@Victor-Rodzko Victor-Rodzko commented May 14, 2026

🎯 What problem does this PR solve?

  • 5 logging-* e2e tests regressed deterministically on Linux / Windows / iOS after QVAC-18183 feat[api]: cancel capability + per-handler cancel scope + structured logging #2036 (M3a) installed the oneAtATimePerModel registry policy: logging-all-addons-silent, logging-long-message, logging-streaming-stress, logging-timestamp-accuracy, logging-namespace-filter.
  • All five failed in ~100 ms with RequestRejectedByPolicyError (52420), citing the same "zombie" requestId from the just-finished logging-persist-across-reload completion.
  • Repro: run 25859103224; same five failures reproduce 5/5 locally on main at the same SHA.

📝 How does it solve it?

  • Test-side mitigation only — the underlying SDK race (client final resolves before server await using ctx disposal) needs a separate SDK-side fix, tracked in QVAC-18935.
  • callWhenAddonIdle now treats RequestRejectedByPolicyError as the same transient condition as the legacy llama.cpp busy throw. Both mean "slot still occupied, retry shortly" — same 30 s deadline, same 250 ms backoff.
  • runReload joins its own triggering completion before returning, so collectLogs's Promise.race cannot hand the next test a still-running request.

🧪 How was it tested?

  • Local repro: 5/5 failures on main without the fix, same RequestRejectedByPolicyError payload as CI run 25859103224.
  • With fix: npm run typecheck clean in packages/sdk/tests-qvac; on this branch the 5 tests retry through the disposal window and pass.
  • Defense-in-depth: timeout-bounded retry surfaces as AddonBusyTimeoutError (already wired to the dep-eviction cascade) if the slot truly never frees, so a real SDK regression stays loud rather than silently retrying forever.

Refs: QVAC-18935

@Victor-Rodzko Victor-Rodzko requested review from a team as code owners May 14, 2026 13:37
After PR #2036 (M3a) installed the `oneAtATimePerModel` policy on the
completion registry, back-to-back `completion({ modelId })` calls hit a
small disposal window where the prior request's slot is still alive in
`RequestRegistry`. Five `logging-*` e2e tests started failing in ~100 ms
with `RequestRejectedByPolicyError` (52420), each immediately after
`logging-persist-across-reload`.

Test-side mitigation only:
- `callWhenAddonIdle` retries on both the legacy llama.cpp busy throw
  and the new typed policy rejection.
- `runReload` joins its own triggering completion before returning so
  `collectLogs`'s `Promise.race` cannot hand the next test a
  still-running slot.

The underlying SDK race (client `final` resolves before server
`await using ctx` disposal) needs a separate SDK-side fix; see
QVAC-18935.

Refs: QVAC-18935
@Victor-Rodzko Victor-Rodzko force-pushed the test/QVAC-18935-logging-flake-retry-policy-rejection branch from 92ae29e to 07b7076 Compare May 14, 2026 13:43
@Victor-Rodzko Victor-Rodzko added verify tier1 test-e2e-smoke Triggers smoke e2e test suite [Currently SDK-only] verified Authorize secrets / label-gate in PR workflows labels May 14, 2026
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

NamelsKing
NamelsKing previously approved these changes May 14, 2026
Server errors cross the RPC boundary as `RPCError`, so `instanceof
RequestRejectedByPolicyError` is always false on the client. CI on
PR #2065 confirmed this: `addon-logging-during-inference` failed in
204 ms (< 250 ms retry interval), meaning the first attempt's error
was not recognised as transient.

Match by `code === SDK_SERVER_ERROR_CODES.REQUEST_REJECTED_BY_POLICY`
(52420) — the documented client-side check pattern.

Refs: QVAC-18935
@Victor-Rodzko Victor-Rodzko dismissed stale reviews from NamelsKing and simon-iribarren via a0334b6 May 14, 2026 14:19
@simon-iribarren
Copy link
Copy Markdown
Contributor

Non-blocking, approve stands — but since you mentioned still seeing some Linux flake, two thoughts that might help:

1. The runReload race shape lives in three other handlers too. collectLogs does Promise.race([collectPromise, triggerPromise.then(() => sleep(postTriggerWaitMs))]), so the collect-vs-trigger race that motivated this fix also exists in runAddonLogging, runDuringInference, and runConcurrent — all three pass a trigger that wraps callWhenAddonIdle/runCompletion, and the collector can win the race while the server-side await using ctx is still disposing. runReload is the only call site exposed deterministically (because it unloads/reloads then immediately hands a fresh model to the next test), but if Linux is intermittently flaking on the other four tests too, those three are the next suspects.

Rather than open-code the let triggerPromise = Promise.resolve(); … await triggerPromise.catch(() => {}) join four times, you could lift the join into collectLogs itself so every caller gets it for free:

async function collectLogs(opts: CollectLogsOptions): Promise<CollectLogsResult> {
  // ... existing setup ...
  let triggerPromise: Promise<void> = Promise.resolve();
  const triggerWrapper = (async () => {
    await sleep(STREAM_OPEN_DELAY_MS + preTriggerExtraWaitMs);
    if (trigger) {
      triggerStartMs = Date.now();
      triggerPromise = trigger();
      await triggerPromise;
    }
  })();

  await Promise.race([collectPromise, triggerWrapper.then(() => sleep(postTriggerWaitMs))]);
  await triggerPromise.catch(() => {});

  return { logs, passed: logs.length > 0, output: /* … */ };
}

Closes the race for all four handlers, and runReload drops back to its pre-PR shape.

2. Tiny readability nit. The await triggerPromise.catch(() => {}) swallow is intentional (result is already decided by the race; callWhenAddonIdle already converts true never-free conditions into AddonBusyTimeoutError), but the SDK convention is "no .catch(() => {})". A one-line comment above it (e.g. // drain: result already decided by race; AddonBusyTimeoutError still surfaces real regressions) would head off future readers. If you fold (1), this goes away naturally.

Either way the fix is correctly scoped and the approval stands.

@Victor-Rodzko Victor-Rodzko added test-e2e-smoke Triggers smoke e2e test suite [Currently SDK-only] and removed test-e2e-smoke Triggers smoke e2e test suite [Currently SDK-only] labels May 14, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

QVAC E2E — android⚠️ no results

Config: suite=smoke · filter=(none) · exclude=(none)
View run · Artifacts: Device Farm logs

The test job did not produce a results artifact. Check the run for job-level failures.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

QVAC E2E — ios⚠️ no results

Config: suite=smoke · filter=(none) · exclude=(none)
View run

The test job did not produce a results artifact. Check the run for job-level failures.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

QVAC E2E — windows — ✅ all tests passed (91/91, 403s)

Config: suite=smoke · filter=(none) · exclude=(none)
View run · Artifacts: reports

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

QVAC E2E — linux — ✅ all tests passed (91/91, 343s)

Config: suite=smoke · filter=(none) · exclude=(none)
View run · Artifacts: reports

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

QVAC E2E — macos⚠️ no results

Config: suite=smoke · filter=(none) · exclude=(none)
View run

The test job did not produce a results artifact. Check the run for job-level failures.

@kinsta
Copy link
Copy Markdown

kinsta Bot commented May 14, 2026

Preview deployments for qvac-docs-staging ⚡️

Status Branch preview Commit preview
🔁 Deploying... N/A N/A

Commit: a62b507c72b0f4db9c68b50b56504f39777424b8

Deployment ID: 599ed843-3213-4e89-b3fe-6a40f55b0157

Static site name: qvac-docs-staging-fazwv

@Victor-Rodzko Victor-Rodzko merged commit 8ca7bd3 into main May 14, 2026
25 checks passed
@Victor-Rodzko Victor-Rodzko deleted the test/QVAC-18935-logging-flake-retry-policy-rejection branch May 14, 2026 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test-e2e-smoke Triggers smoke e2e test suite [Currently SDK-only] tier1 verified Authorize secrets / label-gate in PR workflows verify

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants