Skip to content

QVAC-18183 feat[api]: cancel capability + per-handler cancel scope + structured logging#2036

Merged
NamelsKing merged 5 commits into
tetherto:mainfrom
simon-iribarren:feat/qvac-18183-cancel-capability-per-handler-scope-logging
May 14, 2026
Merged

QVAC-18183 feat[api]: cancel capability + per-handler cancel scope + structured logging#2036
NamelsKing merged 5 commits into
tetherto:mainfrom
simon-iribarren:feat/qvac-18183-cancel-capability-per-handler-scope-logging

Conversation

@simon-iribarren
Copy link
Copy Markdown
Contributor

@simon-iribarren simon-iribarren commented May 13, 2026

Wider context. This PR lands the framework primitives for QVAC-18183 — Per-model concurrency policy and uniform handler/worker authoring. The previous epic milestone — QVAC-18182 — Typed cancel outcomes + KvCacheSession — landed in #2007. Subsequent handler-migration PRs continue against the same QVAC-18183 ticket.

🎯 What problem does this PR solve?

  • "What does cancel do for this handler" was implicit. Every cancel path through RequestRegistry worked after QVAC-18182 feat[api]: typed cancel outcomes on the wire + atomic KV-cache via KvCacheSession #2007, but the addon-side truth — does this handler accept a requestId, does it cancel at the model level, or is it un-cancellable? — lived in handler bodies and reviewer memory. Per-handler migration needs a declarative answer so it's a manifest change rather than a code spelunk.
  • No typed admission control on the SDK. The llama.cpp addon already serialises concurrent runs internally — exclusiveRunQueue queues them, and a second _runInternal(...) that finds _hasActiveResponse === true throws Error('Cannot set new job: a job is already set or being processed') (RUN_BUSY_ERROR_MESSAGE in packages/llm-llamacpp/index.js). The pre-existing failure surface was an opaque Error deep in the addon. This PR moves that admission rule up to the SDK as a typed RequestRejectedByPolicyError (52420) that fires before the addon is even called, so clients can instanceof-narrow and produce a stable "model busy" message instead of pattern-matching on a free-form string.
  • Per-request observability was free-form. "What happened on requestId X" required tailing free-form log lines and inferring causality across files. Bare has no APM agent inside the worker, so logs are the only cross-handler observability surface, and they need a fixed shape to be useful.

📝 How does it solve it?

Lands the three framework primitives so subsequent handler-migration PRs against QVAC-18183 have a single, declarative contract to slot into. Framework-only — no handler is migrated onto registry.begin(...) here beyond the completion handler that landed in #2007.

Deliverable 1 — PluginHandlerDefinition.cancel: { scope, hard? }

  • New types PluginHandlerCancel / PluginHandlerCancelScope on schemas/plugin.ts with runtime-schema validation on pluginHandlerDefinitionRuntimeSchema. Plugin manifests that omit cancel still load (it's optional); manifests with cancel.scope outside "request" | "model" | "none" are rejected at registration time.
  • Declared on every built-in plugin manifest. The truth-table assignment is pinned by test/unit/plugin-cancel-capability.test.ts:
Plugin Handler cancel
llamacpp-completion completionStream, translate { scope: "model", hard: true }
llamacpp-completion finetune { scope: "none" }
llamacpp-embedding embed { scope: "model", hard: true }
whispercpp-transcription transcribe, transcribeStream { scope: "model", hard: true }
parakeet-transcription transcribe, transcribeStream { scope: "model", hard: true }
nmtcpp-translation translate { scope: "none" }
onnx-tts textToSpeech, textToSpeechStream { scope: "none" }
onnx-ocr ocrStream { scope: "none" }
sdcpp-generation diffusionStream { scope: "model", hard: true }
sdcpp-generation upscaleStream { scope: "none" }

scope: "none" is the soft-cancel contract — registry aborts the signal, the handler's stream stops yielding, result is discarded; the C++ work runs to completion in the background. No correctness issue because the result is no longer observed.

Deliverable 2 — RequestRegistry.policy({ kind, oneAtATimePerModel })

  • New policy(opts) method on RequestRegistry registers a ConcurrencyPolicy per RequestKind. The policy check runs before controller / scope allocation in begin(...), so a rejected begin leaves zero registry state behind.
  • Rejection throws RequestRejectedByPolicyError (code 52420) carrying { requestId, kind, modelId, reason }. Re-exported from @qvac/sdk for instanceof narrowing on the client side.
  • The worker singleton (request-registry-singleton.ts) installs { kind: "completion", oneAtATimePerModel: true } on first access — promoting the llama.cpp addon's internal _hasActiveResponse guard from an opaque deep-stack Error to a typed framework-level rejection.
  • The slot is held until await using unwinds, not when cancel(...) fires. A cancelled-but-still-draining request still owns its addon's KV-cache / decode loop, so admission stays held until dispose. Pinned by test/unit/runtime/request-registry.test.ts ("policy: cancel without dispose does NOT release admission") so a future contributor doesn't "fix" the semantic the wrong way.

Other kinds (embeddings, transcribe, …) are intentionally not policied here; subsequent handler-migration PRs on QVAC-18183 will pick admission rules per kind based on their addons' own concurrency models.

Deliverable 3 — Structured [request-lifecycle] logging

Three lines per request with a fixed shape:

[request-lifecycle] begin  requestId=<id> kind=<kind> modelId=<id|"-"> state=running
[request-lifecycle] cancel requestId=<id> kind=<kind> modelId=<id|"-"> state=cancelling
[request-lifecycle] end    requestId=<id> kind=<kind> modelId=<id|"-"> state=<terminal> durationMs=<n>
  • Levels are info everywhere except an end with state=failed, which fires at warn so ops alerting can predicate on level>=warn for this prefix without parsing the message body.
  • durationMs= (wall-clock from begin to dispose) lands on the end line so log shippers can compute per-request latency without diffing timestamps across two lines.
  • New withRequestContext(logger, ctx) helper in server/bare/runtime/with-request-context.ts wraps any Logger so every emit ships the same requestId=… kind=… modelId=… prefix. It passes through setLevel / getLevel / addTransport / setConsoleOutput so the wrapped instance is observably indistinguishable from the underlying logger except for the prefix.
  • withRequestContext(...) is threaded through completion(...) and into createKvCacheSession(modelId, { logger }), so KV-cache turn lifecycle (beginTurn, commitTurn, rollback, prime verification, file delete) inherits the request's correlation tuple. A single grep "requestId=abc-123" returns the full per-request story chronologically.
  • cancelEntry guards against duplicate cancel emits: a cancel({ requestId }) invoked twice (or once on an already-aborted entry) logs once.

Cursor rules updated in this PR

  • .cursor/rules/sdk/request-lifecycle-primitives.mdc — cancel-capability declaration table for plugin authors, new "don't reach into addon cancel() directly" anti-pattern, concurrency-policy contract, structured-logging shape (now with durationMs + level-split), error-codes table now carries RequestRejectedByPolicyError (52420).
  • .cursor/rules/sdk/docs/request-lifecycle-system.mdc — migration-roadmap table updated; three new FAQ entries explain why each primitive was chosen (the rationale that's hard to recover from code alone); implementation files table covers the new modules.
  • .cursor/rules/sdk/error-handling.mdc — 52420 row added.

🧪 How was it tested?

Pre-merge gates (run from packages/sdk/):

bun run lint        # eslint + tsc, clean
bun run test:unit   # all files, all asserts pass

Unit tests added in this PR:

  • test/unit/plugin-cancel-capability.test.ts (7 tests): runtime-schema acceptance of the new optional cancel field; rejection of invalid scope values; rejection of non-boolean hard. A bareTest-gated section pins the per-plugin truth table by dynamic-importing every built-in manifest and asserting definition.cancel matches the expected { scope, hard } cell. Bare-only because the manifests transitively load native addons (@qvac/llm-llamacpp etc.) that don't resolve under Bun's resolver.

  • test/unit/runtime/request-registry.test.ts (24 tests): added 8 policy tests — oneAtATimePerModel rejects a second begin on the same (kind, modelId); scopes admission per (kind, modelId) not globally; ignores requests without modelId; disposing the holder releases admission; cancel without dispose does NOT release admission (pins the actual cancel/dispose semantic); replacing a policy by calling policy(...) again applies the new rule; kinds without a registered policy are unconstrained. Rejection-path assertions verify the typed RequestRejectedByPolicyError shape (requestId, kind, modelId, reason).

  • test/unit/runtime/request-lifecycle-logging.test.ts (7 tests): stub Logger captures info + warn calls; tests assert the begin/cancel/end line shape including state=running|cancelling|<terminal> + durationMs=<n> segment on end; single-cancel-emit guarantee when cancel({ requestId }) fires twice; cancelAll fan-out emits one cancel line per in-flight entry; cancel-before-begin race-close path emits a begin line with state=cancelling; failed end emits at warn level, not info (ops-alerting predicate).

  • test/unit/runtime/with-request-context.test.ts (5 tests): prefix shape across every log level (error / warn / info / debug / trace); modelId segment dropped when undefined; extra args after the leading message preserved; zero-argument emits still ship the prefix on its own; setLevel / getLevel / addTransport / setConsoleOutput write through to the underlying logger reference unchanged.

Why no e2e in this PR. Changes here are purely additive at the framework level:

  • The cancel field on PluginHandlerDefinition is optional and unread by any code path in this PR — subsequent handler-migration PRs on QVAC-18183 will start reading it.
  • The default oneAtATimePerModel policy for completion does not change observable behaviour for well-behaved single-completion clients — it changes the failure surface for concurrent completions from Error('Cannot set new job: a job is already set or being processed') (raised inside the llama.cpp addon) to a typed RequestRejectedByPolicyError (raised at registry.begin(...)). Clients that already handle the addon's error continue to work; clients that want typed handling get a stable instanceof check.
  • The structured logs are best-effort observability; they don't affect any control flow.

The existing e2e suite from #2007 (packages/sdk/tests-qvac/tests/cancellation-tests.ts) covers the begin/cancel/end paths and will continue to pass — CI will exercise it on this PR. Handler-migration e2e arrives with the subsequent handler-migration PRs on QVAC-18183.

🔌 API Changes

Three additive surfaces. Existing consumers compile and run unchanged.

1. PluginHandlerDefinition.cancel (new optional field)

import { definePlugin, defineHandler } from "@qvac/sdk";

definePlugin({
  manifestVersion: 1,
  handlers: {
    myStream: defineHandler({
      requestSchema,
      responseSchema,
      streaming: true,
      cancel: { scope: "model", hard: true }, // <-- new
      handler: async function* (request, ctx) { /* ... */ },
    }),
  },
});

scope values:

  • "request" — addon accepts a requestId token; cancel routes per-request.
  • "model" — addon cancels "the thing currently running on this model" (requires admission control like oneAtATimePerModel to be useful).
  • "none" — addon has no cancel surface; SDK falls back to soft-cancel (signal aborts, stream stops yielding, C++ runs to completion in background).

hard: true documents that the addon-side cancel interrupts compute. hard: false (or omitted) means cancel is best-effort.

Existing plugin manifests that omit cancel continue to load; the field is optional.

2. RequestRegistry.policy(...) (new method) + RequestRejectedByPolicyError

Two audiences, two import paths:

  • Client-side consumers narrow on the error type. RequestRejectedByPolicyError is on the public @qvac/sdk root (packages/sdk/index.ts):
import { RequestRejectedByPolicyError } from "@qvac/sdk";

try {
  for await (const e of sdk.completion({ ... })) { /* ... */ }
} catch (err) {
  if (err instanceof RequestRejectedByPolicyError) {
    showBusy({ modelId: err.modelId, reason: err.reason });
  } else {
    throw err;
  }
}
  • Handler authors inside the SDK package reach the registry through the worker-internal barrel @/server/bare/runtime (the path @qvac/sdk does not re-export these — they live behind the bare-runtime boundary):
import { getRequestRegistry } from "@/server/bare/runtime";

const registry = getRequestRegistry();
registry.policy({ kind: "embeddings", oneAtATimePerModel: true });

await using ctx = registry.begin({ requestId, kind: "embeddings", modelId });
// ...

The worker singleton auto-installs the completion policy on first access; no caller code change required for that case. Other kinds remain unconstrained until a future PR opts them in.

3. withRequestContext(logger, ctx) (new helper)

Worker-internal — only handler authors writing code inside the SDK package use it directly. Imported from the same bare-runtime barrel as the registry:

import { getRequestRegistry, withRequestContext } from "@/server/bare/runtime";
import { getServerLogger } from "@/logging";

const registry = getRequestRegistry();
await using ctx = registry.begin({ requestId, kind: "completion", modelId });
const logger = withRequestContext(getServerLogger(), ctx);

logger.info("starting work"); // -> "[request-lifecycle completion requestId=... modelId=...] starting work"

Handler authors don't need to construct one themselves once handler migrations land on QVAC-18183 (the framework will hand them a pre-wired logger via ctx).

@simon-iribarren simon-iribarren requested review from a team as code owners May 13, 2026 19:59
@simon-iribarren simon-iribarren force-pushed the feat/qvac-18183-cancel-capability-per-handler-scope-logging branch 2 times, most recently from 42e7df4 to 80d15a2 Compare May 14, 2026 08:16
…structured logging

Lands the three M3a framework primitives so subsequent handler migration
sub-PRs (M3b/M3c) have a single, declarative contract to slot into:

1. `PluginHandlerDefinition.cancel: { scope: "request" | "model" | "none"; hard?: boolean }`
   - Added to `schemas/plugin.ts` (`PluginHandlerCancel`, `PluginHandlerCancelScope`)
     + runtime schema validation on `pluginHandlerDefinitionRuntimeSchema`.
   - Declared on every built-in plugin manifest (llamacpp-completion,
     llamacpp-embedding, whispercpp/parakeet-transcription, nmtcpp-translation,
     onnx-tts/ocr, sdcpp-generation). The truth-table assignment is pinned by
     `test/unit/plugin-cancel-capability.test.ts`.

2. `RequestRegistry.policy({ kind, oneAtATimePerModel })`:
   - Admission control runs before scope/controller allocation in `begin(...)`.
     Rejecting a request raises `RequestRejectedByPolicyError` (52420) carrying
     `requestId`, `kind`, `modelId`, `reason` — re-exported from `@qvac/sdk` for
     `instanceof` checks.
   - The worker singleton installs `{ kind: "completion", oneAtATimePerModel: true }`
     on first access, matching the llama.cpp addon's single-decode-loop reality.

3. Structured `[request-lifecycle]` emits at begin/cancel/end:
   - Fixed log shape `requestId=<id> kind=<kind> modelId=<id|"-"> state=<state>` so
     `grep "requestId=abc"` returns the full per-request story chronologically.
   - `withRequestContext(logger, ctx)` extends the same prefix to handler-level
     emits; threaded through `completion(...)` and into `KvCacheSession` so
     KV-cache turn lifecycle shares the request's correlation tuple.
   - Single-cancel-emit guard suppresses duplicate cancel lines when
     `cancel({ requestId })` is invoked twice.

Verification (from `packages/sdk/`):
- `bun run lint` (eslint + tsc): clean.
- `bun run test:unit`: 49 files / all asserts pass, including the 4 M3a test
  files (`plugin-cancel-capability` 7/7, `request-registry` 41/41,
  `request-lifecycle-logging` 6/6, `with-request-context` 5/5).

Cursor rules updated alongside the code:
- `request-lifecycle-primitives.mdc`: cancel-capability declaration table,
  concurrency-policy contract, structured-logging shape, error-codes table now
  carries 52420.
- `docs/request-lifecycle-system.mdc`: migration-roadmap table reflects M3a
  shipped; three new FAQ entries explain *why* each primitive was chosen;
  implementation files table covers the new modules.
- `error-handling.mdc`: 52420 row added.

This PR is framework-only — no handler is migrated onto `registry.begin(...)` here
beyond the completion handler that landed in M2. Handler migrations follow in
M3b (inference handlers), M3c (non-inference / addon handlers), and M3d (CLI
cancel bridge + cancelHandler retirement).
@simon-iribarren simon-iribarren force-pushed the feat/qvac-18183-cancel-capability-per-handler-scope-logging branch from 80d15a2 to 31605bd Compare May 14, 2026 08:21
@simon-iribarren simon-iribarren added test-e2e-smoke Triggers smoke e2e test suite [Currently SDK-only] and removed test-e2e-smoke Triggers smoke e2e test suite [Currently SDK-only] labels May 14, 2026
@opaninakuffo
Copy link
Copy Markdown
Contributor

The framework looks solid: policy runs before any controller/scope allocation, RequestRejectedByPolicyError is wired with a dedicated code, lifecycle logs are keyed and failed bumps to warn as advertised, and the unit-test matrix matches the PR narrative.

@opaninakuffo
Copy link
Copy Markdown
Contributor

nit: the PR “API Changes” snippet imports getRequestRegistry / withRequestContext from @qvac/sdk, but packages/sdk/index.ts only adds RequestRejectedByPolicyError on the public root — those two symbols live on the bare runtime module path, not the root export map. Worth retuning the example so copy-paste matches reality (or explicitly exporting them if that was the intent).

@simon-iribarren simon-iribarren added the verified Authorize secrets / label-gate in PR workflows label May 14, 2026
@tetherto tetherto deleted a comment from github-actions Bot May 14, 2026
@simon-iribarren
Copy link
Copy Markdown
Contributor Author

/review

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

@simon-iribarren simon-iribarren removed test-e2e-smoke Triggers smoke e2e test suite [Currently SDK-only] verified Authorize secrets / label-gate in PR workflows labels May 14, 2026
@NamelsKing NamelsKing added the verified Authorize secrets / label-gate in PR workflows label May 14, 2026
@NamelsKing
Copy link
Copy Markdown
Contributor

/review

@NamelsKing NamelsKing merged commit 9cc6018 into tetherto:main May 14, 2026
31 checks passed
Victor-Rodzko added a commit that referenced this pull request May 14, 2026
After PR #2036 (M3a) installed the `oneAtATimePerModel` policy on the
completion registry, back-to-back `completion({ modelId })` calls hit a
small disposal window where the prior request's slot is still alive in
`RequestRegistry`. Five `logging-*` e2e tests started failing in ~100 ms
with `RequestRejectedByPolicyError` (52420), each immediately after
`logging-persist-across-reload`.

Test-side mitigation only:
- `callWhenAddonIdle` retries on both the legacy llama.cpp busy throw
  and the new typed policy rejection.
- `runReload` joins its own triggering completion before returning so
  `collectLogs`'s `Promise.race` cannot hand the next test a
  still-running slot.

The underlying SDK race (client `final` resolves before server
`await using ctx` disposal) needs a separate SDK-side fix; see
QVAC-18935.

Refs: QVAC-18935
Victor-Rodzko added a commit that referenced this pull request May 14, 2026
…gging tests (#2065)

* test[skiplog]: retry on RequestRejectedByPolicyError in logging tests

After PR #2036 (M3a) installed the `oneAtATimePerModel` policy on the
completion registry, back-to-back `completion({ modelId })` calls hit a
small disposal window where the prior request's slot is still alive in
`RequestRegistry`. Five `logging-*` e2e tests started failing in ~100 ms
with `RequestRejectedByPolicyError` (52420), each immediately after
`logging-persist-across-reload`.

Test-side mitigation only:
- `callWhenAddonIdle` retries on both the legacy llama.cpp busy throw
  and the new typed policy rejection.
- `runReload` joins its own triggering completion before returning so
  `collectLogs`'s `Promise.race` cannot hand the next test a
  still-running slot.

The underlying SDK race (client `final` resolves before server
`await using ctx` disposal) needs a separate SDK-side fix; see
QVAC-18935.

Refs: QVAC-18935

* fixup: match RequestRejectedByPolicyError by code, not instanceof

Server errors cross the RPC boundary as `RPCError`, so `instanceof
RequestRejectedByPolicyError` is always false on the client. CI on
PR #2065 confirmed this: `addon-logging-during-inference` failed in
204 ms (< 250 ms retry interval), meaning the first attempt's error
was not recognised as transient.

Match by `code === SDK_SERVER_ERROR_CODES.REQUEST_REJECTED_BY_POLICY`
(52420) — the documented client-side check pattern.

Refs: QVAC-18935
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tier1 verified Authorize secrets / label-gate in PR workflows verify

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants