Skip to content

feat(producer,cli): render-reliability telemetry counters for capture hardening#1850

Merged
jrusso1020 merged 2 commits into
mainfrom
fix/render-reliability-telemetry
Jul 2, 2026
Merged

feat(producer,cli): render-reliability telemetry counters for capture hardening#1850
jrusso1020 merged 2 commits into
mainfrom
fix/render-reliability-telemetry

Conversation

@jrusso1020

Copy link
Copy Markdown
Collaborator

What & why

Follow-up to the render-reliability batch (#1841 / #1842 / #1843, all merged). During review, Via and Magi flagged on #1842 that the P2-5 capture hardening (transient-tab-death retry, OOM classification) is log-only — so its effect is invisible on PostHog dashboard 1783183 until an incident. This threads two counters through the existing observability → CLI-telemetry pipeline (no new PostHog wiring), so the hardening is measurable.

Changes

  • Transient-retry burn. CaptureAttemptSummary.reason gains a "transient-retry" variant, distinct from the worker-halving "retry"; executeDiskCaptureWithAdaptiveRetry tags the same-worker-count retry it fires after a transient tab death. executeRenderJob counts them into RenderCaptureObservability.transientRetries on successful renders — answers "are we burning the retry budget but recovering?"
  • OOM classification. The top-level capture catch sets RenderCaptureObservability.memoryExhaustionDetected whenever describeMemoryExhaustion classifies the failure (Set maximum size exceeded & friends) — answers "is OOM the dominant failure tail?"
  • Both surface via renderObservabilityTelemetryPayload → new capture_transient_retries and capture_memory_exhaustion_detected props on the render telemetry events.

Tests

  • Producer: executeDiskCaptureWithAdaptiveRetry transient-retry integration test now asserts the retry attempt is tagged transient-retry.
  • CLI: new renderObservability.test.ts locks the payload mapping (present when set, undefined otherwise).

Full local gate green: build, tsc --noEmit (producer + cli), producer 79 + cli mapper 2 tests, oxlint, oxfmt, fallow audit --base origin/main (the only findings are pre-existing test-boilerplate clones, gate-excluded).

Scope / follow-ups

Deliberately scoped to the two counters the #1842 reviewers emphasized. Still log-only and left for a further follow-up (different subsystems): the encoder-frame-0-exit signal (streaming stage), and the P1-3 pre-flight-rejection / P1-4 cli_env_check counters (CLI render/doctor commands).

Measured on PostHog dashboard 1783183 (render failure taxonomy / first-render success).

🤖 Generated with Claude Code

@vanceingalls vanceingalls left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 LGTM — closes the #1842 telemetry-gap heads-up

Follow-up to my heads-up on #1842. Both counters land in the right spots, tests lock the mapping, and no new PostHog wiring is introduced — the capture_* prop naming is the correct pipeline-consistent framing (my original chaos.transient_retry framing was PostHog-dashboard-shaped; the CLI-telemetry props are the right layer to instrument at).

capture_transient_retries — placement ✅

Counted as captureAttempts.filter(a => a.reason === "transient-retry").length at the top of the success path in executeRenderJob (renderOrchestrator.ts:1926). Crucially, this counts every attempt tagged transient-retry, so it catches:

  • transient-retry that succeeded on that attempt (the primary "burned budget but recovered" question)
  • transient-retry that made partial progress and then the worker-halving retry finished the frames (multi-attempt recovery)

The pendingTransientRetry flag flow — set in the catch after transientRetriesUsed++, consumed on the next iteration's attempts.push — is clean and doesn't mis-tag the worker-halving path. The test on renderOrchestrator.test.ts:226-228 locks the attempt sequence to ["initial", "transient-retry"], which is the right invariant to defend.

capture_memory_exhaustion_detected — placement ✅

Set at renderOrchestrator.ts:2023 when describeMemoryExhaustion returns non-null, which is the same classifier that composes errorMessage. So the flag alignment is exact — no drift between "what shows in the error string" and "what the counter says." Side-effect only, no swallowed exceptions, no double-count with the retry counter.

CLI mapping test covers both the set-and-mapped and the unset-so-undefined cases — the right shape to lock, since a payload key that's always present would over-count.

🟠 should-fix (non-blocking) — failure-path coverage on transient-retry burn

Deliberate scope per the PR body ("counts them on successful renders"), so this is a heads-up not a blocker: captureAttempts is populated by executeDiskCaptureWithAdaptiveRetry regardless of outcome and is in-scope at the catch block on renderOrchestrator.ts:1998-2028. The current wiring answers "we burned budget and recovered" but the sibling question "we burned budget and still failed" — which is the more actionable ops signal for tuning MAX_TRANSIENT_CAPTURE_RETRIES — isn't emitted. Mirroring the counter into the catch block (guarded by transientRetryCount > 0) would be a ~3-line follow-up. Fine to ship as-is if the framing is deliberately "recovery-shaped only" for now.

CI

Build / Typecheck / Test / Lint / Fallow audit / Preview parity all green on 785face. Regression shards + Windows tests still in-flight at review time — normal for this repo, not a blocker.

🟠 pr-envelope (fyi, non-blocking)

Commit trailer Co-Authored-By: Claude Opus 4.8 and PR body 🤖 Generated with [Claude Code]. Consistent with the other render-reliability follow-ups in this batch; Miguel isn't tagged. Flagging per pattern only.


Review by Via

jrusso1020 added a commit that referenced this pull request Jul 2, 2026
Addresses Via's should-fix on #1850: the transient-retry counter previously
fired only on the success path ("burned budget and recovered"). Mirror it into
the failure path so "burned budget and STILL failed" is also emitted — the more
actionable signal for tuning MAX_TRANSIENT_CAPTURE_RETRIES.

Note: `captureAttempts` was declared `const` INSIDE the try, so it was NOT in
scope in the catch (contrary to the review note). Hoisted it to function scope
and extracted a shared `recordTransientRetryObservability` helper called from
both the success path and the catch, so the two sites can't drift.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jrusso1020

Copy link
Copy Markdown
Collaborator Author

Thanks @vanceingalls — folded in the failure-path counter in d120568c1.

One correction on the scope note: captureAttempts was declared const inside the try (renderOrchestrator.ts, the try opens ~40 lines above the capture-stage call), so it was not actually in scope at the catch block — referencing it there wouldn't compile (the catch-path vars like captureCompositionWidth work because they're let-declared at function scope). So I:

  • Hoisted captureAttempts to function scope.
  • Extracted a shared recordTransientRetryObservability() helper (count → updateCaptureObservability) called from both the success path and the catch, so the "recovered" and "burned-and-still-failed" sites can't drift.

Now capture_transient_retries is emitted on both outcomes — the failed-render case being the actionable one for tuning MAX_TRANSIENT_CAPTURE_RETRIES you called out. Gate green (typecheck, producer 79 tests, lint/fmt/fallow). The success-path test already locks the ["initial","transient-retry"] tagging that both sites filter on.

Left the AI-trailer/envelope as-is per the batch convention (verified standard in merged history; Miguel approved the sibling PRs with it intact).

@miguel-heygen miguel-heygen left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean follow-up to #1842. I verified the new transient-retry attempt reason is assigned only to the same-worker retry after transient browser death (packages/producer/src/services/renderOrchestrator.ts:674, packages/producer/src/services/renderOrchestrator.ts:803), and the shared recorder now feeds both recovered renders and terminal failures (packages/producer/src/services/renderOrchestrator.ts:1040, packages/producer/src/services/renderOrchestrator.ts:1931, packages/producer/src/services/renderOrchestrator.ts:2019). The CLI mapping is end-to-end through capture_transient_retries / capture_memory_exhaustion_detected (packages/cli/src/telemetry/renderObservability.ts:41, packages/cli/src/telemetry/events.ts:75), with focused mapping tests (packages/cli/src/telemetry/renderObservability.test.ts:26) and the producer attempt-tag invariant pinned (packages/producer/src/services/renderOrchestrator.test.ts:223).

Nit only: packages/producer/src/services/render/observability.ts:49 still says the retry counter is for renders that ultimately succeeded; after d120568c it also records failed renders. Comment-only, not a blocker.

CI note: the initial required Test job failed on an unrelated @hyperframes/aws-lambda hook timeout; I reran it and the rerun passed. Remaining long-running Windows/regression checks were still pending at review time, with no failures.

Verdict: APPROVE
Reasoning: The telemetry is wired at the correct capture boundary, reaches the existing CLI telemetry pipeline without new PostHog plumbing, and the final commit covers Via's failed-render retry-burn gap without changing capture behavior.

— Magi

jrusso1020 and others added 2 commits July 1, 2026 20:34
…pture hardening

Follow-up to #1842 / #1843 review feedback (Via + Magi): the P2-5 capture
hardening was log-only, so its effect is invisible on PostHog dashboard 1783183
until an incident. Thread two counters through the existing observability →
CLI-telemetry pipeline (no new PostHog wiring):

- **transient-retry burn** — `CaptureAttemptSummary.reason` gains a
  `"transient-retry"` variant (distinct from the worker-halving `"retry"`), so
  `executeDiskCaptureWithAdaptiveRetry` tags same-worker retries after a
  transient tab death. `executeRenderJob` counts them into
  `RenderCaptureObservability.transientRetries` on successful renders — answers
  "are we burning the retry budget but recovering?".
- **OOM classification** — the top-level catch sets
  `RenderCaptureObservability.memoryExhaustionDetected` whenever
  `describeMemoryExhaustion` classifies the failure — answers "is OOM the
  dominant failure tail?".

Both surface via `renderObservabilityTelemetryPayload` → `capture_transient_retries`
and `capture_memory_exhaustion_detected` PostHog props on the render events.

Tests: producer asserts the transient retry attempt is tagged `transient-retry`;
new CLI test locks the payload mapping (present when set, undefined otherwise).

Scope: the encoder-frame-0-exit signal (also log-only) and the P1-3 pre-flight-
rejection / P1-4 cli_env_check counters remain a further follow-up — they live in
different subsystems (streaming stage, CLI render/doctor commands).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses Via's should-fix on #1850: the transient-retry counter previously
fired only on the success path ("burned budget and recovered"). Mirror it into
the failure path so "burned budget and STILL failed" is also emitted — the more
actionable signal for tuning MAX_TRANSIENT_CAPTURE_RETRIES.

Note: `captureAttempts` was declared `const` INSIDE the try, so it was NOT in
scope in the catch (contrary to the review note). Hoisted it to function scope
and extracted a shared `recordTransientRetryObservability` helper called from
both the success path and the catch, so the two sites can't drift.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jrusso1020 jrusso1020 force-pushed the fix/render-reliability-telemetry branch from d120568 to 65655ad Compare July 2, 2026 03:35
@jrusso1020 jrusso1020 merged commit 733f88c into main Jul 2, 2026
48 checks passed
@jrusso1020 jrusso1020 deleted the fix/render-reliability-telemetry branch July 2, 2026 03:57
jrusso1020 added a commit that referenced this pull request Jul 2, 2026
…timeouts (#1855)

The CLI Test job (bun run --filter '!@hyperframes/producer' test) intermittently failed unrelated PRs (#1843, #1850) with `Test timed out in 5000ms` / `Hook timed out in 10000ms`. Root cause: multiple CLI tests cold-import a heavy command module graph via dynamic import() (render.js, auth/status.js, telemetry/system.js), which under the full parallel monorepo run contends for CPU and blows vitest's 5s/10s defaults on constrained runners. Not a product bug.

Fix at the right altitude: set testTimeout 20s + hookTimeout 30s once in packages/cli/vitest.config.ts instead of per-test/per-hook bandaids, and remove the now-redundant explicit 30s beforeAll timeouts added in #1843.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants