feat(producer,cli): render-reliability telemetry counters for capture hardening#1850
Conversation
vanceingalls
left a comment
There was a problem hiding this comment.
🟢 LGTM — closes the #1842 telemetry-gap heads-up
Follow-up to my heads-up on #1842. Both counters land in the right spots, tests lock the mapping, and no new PostHog wiring is introduced — the capture_* prop naming is the correct pipeline-consistent framing (my original chaos.transient_retry framing was PostHog-dashboard-shaped; the CLI-telemetry props are the right layer to instrument at).
capture_transient_retries — placement ✅
Counted as captureAttempts.filter(a => a.reason === "transient-retry").length at the top of the success path in executeRenderJob (renderOrchestrator.ts:1926). Crucially, this counts every attempt tagged transient-retry, so it catches:
- transient-retry that succeeded on that attempt (the primary "burned budget but recovered" question)
- transient-retry that made partial progress and then the worker-halving retry finished the frames (multi-attempt recovery)
The pendingTransientRetry flag flow — set in the catch after transientRetriesUsed++, consumed on the next iteration's attempts.push — is clean and doesn't mis-tag the worker-halving path. The test on renderOrchestrator.test.ts:226-228 locks the attempt sequence to ["initial", "transient-retry"], which is the right invariant to defend.
capture_memory_exhaustion_detected — placement ✅
Set at renderOrchestrator.ts:2023 when describeMemoryExhaustion returns non-null, which is the same classifier that composes errorMessage. So the flag alignment is exact — no drift between "what shows in the error string" and "what the counter says." Side-effect only, no swallowed exceptions, no double-count with the retry counter.
CLI mapping test covers both the set-and-mapped and the unset-so-undefined cases — the right shape to lock, since a payload key that's always present would over-count.
🟠 should-fix (non-blocking) — failure-path coverage on transient-retry burn
Deliberate scope per the PR body ("counts them on successful renders"), so this is a heads-up not a blocker: captureAttempts is populated by executeDiskCaptureWithAdaptiveRetry regardless of outcome and is in-scope at the catch block on renderOrchestrator.ts:1998-2028. The current wiring answers "we burned budget and recovered" but the sibling question "we burned budget and still failed" — which is the more actionable ops signal for tuning MAX_TRANSIENT_CAPTURE_RETRIES — isn't emitted. Mirroring the counter into the catch block (guarded by transientRetryCount > 0) would be a ~3-line follow-up. Fine to ship as-is if the framing is deliberately "recovery-shaped only" for now.
CI
Build / Typecheck / Test / Lint / Fallow audit / Preview parity all green on 785face. Regression shards + Windows tests still in-flight at review time — normal for this repo, not a blocker.
🟠 pr-envelope (fyi, non-blocking)
Commit trailer Co-Authored-By: Claude Opus 4.8 and PR body 🤖 Generated with [Claude Code]. Consistent with the other render-reliability follow-ups in this batch; Miguel isn't tagged. Flagging per pattern only.
Review by Via
Addresses Via's should-fix on #1850: the transient-retry counter previously fired only on the success path ("burned budget and recovered"). Mirror it into the failure path so "burned budget and STILL failed" is also emitted — the more actionable signal for tuning MAX_TRANSIENT_CAPTURE_RETRIES. Note: `captureAttempts` was declared `const` INSIDE the try, so it was NOT in scope in the catch (contrary to the review note). Hoisted it to function scope and extracted a shared `recordTransientRetryObservability` helper called from both the success path and the catch, so the two sites can't drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks @vanceingalls — folded in the failure-path counter in One correction on the scope note:
Now Left the AI-trailer/envelope as-is per the batch convention (verified standard in merged history; Miguel approved the sibling PRs with it intact). |
miguel-heygen
left a comment
There was a problem hiding this comment.
Clean follow-up to #1842. I verified the new transient-retry attempt reason is assigned only to the same-worker retry after transient browser death (packages/producer/src/services/renderOrchestrator.ts:674, packages/producer/src/services/renderOrchestrator.ts:803), and the shared recorder now feeds both recovered renders and terminal failures (packages/producer/src/services/renderOrchestrator.ts:1040, packages/producer/src/services/renderOrchestrator.ts:1931, packages/producer/src/services/renderOrchestrator.ts:2019). The CLI mapping is end-to-end through capture_transient_retries / capture_memory_exhaustion_detected (packages/cli/src/telemetry/renderObservability.ts:41, packages/cli/src/telemetry/events.ts:75), with focused mapping tests (packages/cli/src/telemetry/renderObservability.test.ts:26) and the producer attempt-tag invariant pinned (packages/producer/src/services/renderOrchestrator.test.ts:223).
Nit only: packages/producer/src/services/render/observability.ts:49 still says the retry counter is for renders that ultimately succeeded; after d120568c it also records failed renders. Comment-only, not a blocker.
CI note: the initial required Test job failed on an unrelated @hyperframes/aws-lambda hook timeout; I reran it and the rerun passed. Remaining long-running Windows/regression checks were still pending at review time, with no failures.
Verdict: APPROVE
Reasoning: The telemetry is wired at the correct capture boundary, reaches the existing CLI telemetry pipeline without new PostHog plumbing, and the final commit covers Via's failed-render retry-burn gap without changing capture behavior.
— Magi
…pture hardening Follow-up to #1842 / #1843 review feedback (Via + Magi): the P2-5 capture hardening was log-only, so its effect is invisible on PostHog dashboard 1783183 until an incident. Thread two counters through the existing observability → CLI-telemetry pipeline (no new PostHog wiring): - **transient-retry burn** — `CaptureAttemptSummary.reason` gains a `"transient-retry"` variant (distinct from the worker-halving `"retry"`), so `executeDiskCaptureWithAdaptiveRetry` tags same-worker retries after a transient tab death. `executeRenderJob` counts them into `RenderCaptureObservability.transientRetries` on successful renders — answers "are we burning the retry budget but recovering?". - **OOM classification** — the top-level catch sets `RenderCaptureObservability.memoryExhaustionDetected` whenever `describeMemoryExhaustion` classifies the failure — answers "is OOM the dominant failure tail?". Both surface via `renderObservabilityTelemetryPayload` → `capture_transient_retries` and `capture_memory_exhaustion_detected` PostHog props on the render events. Tests: producer asserts the transient retry attempt is tagged `transient-retry`; new CLI test locks the payload mapping (present when set, undefined otherwise). Scope: the encoder-frame-0-exit signal (also log-only) and the P1-3 pre-flight- rejection / P1-4 cli_env_check counters remain a further follow-up — they live in different subsystems (streaming stage, CLI render/doctor commands). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses Via's should-fix on #1850: the transient-retry counter previously fired only on the success path ("burned budget and recovered"). Mirror it into the failure path so "burned budget and STILL failed" is also emitted — the more actionable signal for tuning MAX_TRANSIENT_CAPTURE_RETRIES. Note: `captureAttempts` was declared `const` INSIDE the try, so it was NOT in scope in the catch (contrary to the review note). Hoisted it to function scope and extracted a shared `recordTransientRetryObservability` helper called from both the success path and the catch, so the two sites can't drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
d120568 to
65655ad
Compare
…timeouts (#1855) The CLI Test job (bun run --filter '!@hyperframes/producer' test) intermittently failed unrelated PRs (#1843, #1850) with `Test timed out in 5000ms` / `Hook timed out in 10000ms`. Root cause: multiple CLI tests cold-import a heavy command module graph via dynamic import() (render.js, auth/status.js, telemetry/system.js), which under the full parallel monorepo run contends for CPU and blows vitest's 5s/10s defaults on constrained runners. Not a product bug. Fix at the right altitude: set testTimeout 20s + hookTimeout 30s once in packages/cli/vitest.config.ts instead of per-test/per-hook bandaids, and remove the now-redundant explicit 30s beforeAll timeouts added in #1843. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What & why
Follow-up to the render-reliability batch (#1841 / #1842 / #1843, all merged). During review, Via and Magi flagged on #1842 that the P2-5 capture hardening (transient-tab-death retry, OOM classification) is log-only — so its effect is invisible on PostHog dashboard 1783183 until an incident. This threads two counters through the existing observability → CLI-telemetry pipeline (no new PostHog wiring), so the hardening is measurable.
Changes
CaptureAttemptSummary.reasongains a"transient-retry"variant, distinct from the worker-halving"retry";executeDiskCaptureWithAdaptiveRetrytags the same-worker-count retry it fires after a transient tab death.executeRenderJobcounts them intoRenderCaptureObservability.transientRetrieson successful renders — answers "are we burning the retry budget but recovering?"RenderCaptureObservability.memoryExhaustionDetectedwheneverdescribeMemoryExhaustionclassifies the failure (Set maximum size exceeded& friends) — answers "is OOM the dominant failure tail?"renderObservabilityTelemetryPayload→ newcapture_transient_retriesandcapture_memory_exhaustion_detectedprops on the render telemetry events.Tests
executeDiskCaptureWithAdaptiveRetrytransient-retry integration test now asserts the retry attempt is taggedtransient-retry.renderObservability.test.tslocks the payload mapping (present when set,undefinedotherwise).Full local gate green: build,
tsc --noEmit(producer + cli), producer 79 + cli mapper 2 tests, oxlint, oxfmt,fallow audit --base origin/main(the only findings are pre-existing test-boilerplate clones, gate-excluded).Scope / follow-ups
Deliberately scoped to the two counters the #1842 reviewers emphasized. Still log-only and left for a further follow-up (different subsystems): the encoder-frame-0-exit signal (streaming stage), and the P1-3 pre-flight-rejection / P1-4
cli_env_checkcounters (CLI render/doctor commands).Measured on PostHog dashboard 1783183 (render failure taxonomy / first-render success).
🤖 Generated with Claude Code