feat(producer,cli): render-reliability telemetry counters for capture hardening by jrusso1020 · Pull Request #1850 · heygen-com/hyperframes

jrusso1020 · 2026-07-02T02:46:11Z

What & why

Follow-up to the render-reliability batch (#1841 / #1842 / #1843, all merged). During review, Via and Magi flagged on #1842 that the P2-5 capture hardening (transient-tab-death retry, OOM classification) is log-only — so its effect is invisible on PostHog dashboard 1783183 until an incident. This threads two counters through the existing observability → CLI-telemetry pipeline (no new PostHog wiring), so the hardening is measurable.

Changes

Transient-retry burn. CaptureAttemptSummary.reason gains a "transient-retry" variant, distinct from the worker-halving "retry"; executeDiskCaptureWithAdaptiveRetry tags the same-worker-count retry it fires after a transient tab death. executeRenderJob counts them into RenderCaptureObservability.transientRetries on successful renders — answers "are we burning the retry budget but recovering?"
OOM classification. The top-level capture catch sets RenderCaptureObservability.memoryExhaustionDetected whenever describeMemoryExhaustion classifies the failure (Set maximum size exceeded & friends) — answers "is OOM the dominant failure tail?"
Both surface via renderObservabilityTelemetryPayload → new capture_transient_retries and capture_memory_exhaustion_detected props on the render telemetry events.

Tests

Producer: executeDiskCaptureWithAdaptiveRetry transient-retry integration test now asserts the retry attempt is tagged transient-retry.
CLI: new renderObservability.test.ts locks the payload mapping (present when set, undefined otherwise).

Full local gate green: build, tsc --noEmit (producer + cli), producer 79 + cli mapper 2 tests, oxlint, oxfmt, fallow audit --base origin/main (the only findings are pre-existing test-boilerplate clones, gate-excluded).

Scope / follow-ups

Deliberately scoped to the two counters the #1842 reviewers emphasized. Still log-only and left for a further follow-up (different subsystems): the encoder-frame-0-exit signal (streaming stage), and the P1-3 pre-flight-rejection / P1-4 cli_env_check counters (CLI render/doctor commands).

Measured on PostHog dashboard 1783183 (render failure taxonomy / first-render success).

🤖 Generated with Claude Code

vanceingalls

🟢 LGTM — closes the #1842 telemetry-gap heads-up

Follow-up to my heads-up on #1842. Both counters land in the right spots, tests lock the mapping, and no new PostHog wiring is introduced — the capture_* prop naming is the correct pipeline-consistent framing (my original chaos.transient_retry framing was PostHog-dashboard-shaped; the CLI-telemetry props are the right layer to instrument at).

capture_transient_retries — placement ✅

Counted as captureAttempts.filter(a => a.reason === "transient-retry").length at the top of the success path in executeRenderJob (renderOrchestrator.ts:1926). Crucially, this counts every attempt tagged transient-retry, so it catches:

transient-retry that succeeded on that attempt (the primary "burned budget but recovered" question)
transient-retry that made partial progress and then the worker-halving retry finished the frames (multi-attempt recovery)

The pendingTransientRetry flag flow — set in the catch after transientRetriesUsed++, consumed on the next iteration's attempts.push — is clean and doesn't mis-tag the worker-halving path. The test on renderOrchestrator.test.ts:226-228 locks the attempt sequence to ["initial", "transient-retry"], which is the right invariant to defend.

capture_memory_exhaustion_detected — placement ✅

Set at renderOrchestrator.ts:2023 when describeMemoryExhaustion returns non-null, which is the same classifier that composes errorMessage. So the flag alignment is exact — no drift between "what shows in the error string" and "what the counter says." Side-effect only, no swallowed exceptions, no double-count with the retry counter.

CLI mapping test covers both the set-and-mapped and the unset-so-undefined cases — the right shape to lock, since a payload key that's always present would over-count.

🟠 should-fix (non-blocking) — failure-path coverage on transient-retry burn

Deliberate scope per the PR body ("counts them on successful renders"), so this is a heads-up not a blocker: captureAttempts is populated by executeDiskCaptureWithAdaptiveRetry regardless of outcome and is in-scope at the catch block on renderOrchestrator.ts:1998-2028. The current wiring answers "we burned budget and recovered" but the sibling question "we burned budget and still failed" — which is the more actionable ops signal for tuning MAX_TRANSIENT_CAPTURE_RETRIES — isn't emitted. Mirroring the counter into the catch block (guarded by transientRetryCount > 0) would be a ~3-line follow-up. Fine to ship as-is if the framing is deliberately "recovery-shaped only" for now.

CI

Build / Typecheck / Test / Lint / Fallow audit / Preview parity all green on 785face. Regression shards + Windows tests still in-flight at review time — normal for this repo, not a blocker.

🟠 pr-envelope (fyi, non-blocking)

Commit trailer Co-Authored-By: Claude Opus 4.8 and PR body 🤖 Generated with [Claude Code]. Consistent with the other render-reliability follow-ups in this batch; Miguel isn't tagged. Flagging per pattern only.

Review by Via

Addresses Via's should-fix on #1850: the transient-retry counter previously fired only on the success path ("burned budget and recovered"). Mirror it into the failure path so "burned budget and STILL failed" is also emitted — the more actionable signal for tuning MAX_TRANSIENT_CAPTURE_RETRIES. Note: `captureAttempts` was declared `const` INSIDE the try, so it was NOT in scope in the catch (contrary to the review note). Hoisted it to function scope and extracted a shared `recordTransientRetryObservability` helper called from both the success path and the catch, so the two sites can't drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jrusso1020 · 2026-07-02T02:57:09Z

Thanks @vanceingalls — folded in the failure-path counter in d120568c1.

One correction on the scope note: captureAttempts was declared const inside the try (renderOrchestrator.ts, the try opens ~40 lines above the capture-stage call), so it was not actually in scope at the catch block — referencing it there wouldn't compile (the catch-path vars like captureCompositionWidth work because they're let-declared at function scope). So I:

Hoisted captureAttempts to function scope.
Extracted a shared recordTransientRetryObservability() helper (count → updateCaptureObservability) called from both the success path and the catch, so the "recovered" and "burned-and-still-failed" sites can't drift.

Now capture_transient_retries is emitted on both outcomes — the failed-render case being the actionable one for tuning MAX_TRANSIENT_CAPTURE_RETRIES you called out. Gate green (typecheck, producer 79 tests, lint/fmt/fallow). The success-path test already locks the ["initial","transient-retry"] tagging that both sites filter on.

Left the AI-trailer/envelope as-is per the batch convention (verified standard in merged history; Miguel approved the sibling PRs with it intact).

miguel-heygen

Clean follow-up to #1842. I verified the new transient-retry attempt reason is assigned only to the same-worker retry after transient browser death (packages/producer/src/services/renderOrchestrator.ts:674, packages/producer/src/services/renderOrchestrator.ts:803), and the shared recorder now feeds both recovered renders and terminal failures (packages/producer/src/services/renderOrchestrator.ts:1040, packages/producer/src/services/renderOrchestrator.ts:1931, packages/producer/src/services/renderOrchestrator.ts:2019). The CLI mapping is end-to-end through capture_transient_retries / capture_memory_exhaustion_detected (packages/cli/src/telemetry/renderObservability.ts:41, packages/cli/src/telemetry/events.ts:75), with focused mapping tests (packages/cli/src/telemetry/renderObservability.test.ts:26) and the producer attempt-tag invariant pinned (packages/producer/src/services/renderOrchestrator.test.ts:223).

Nit only: packages/producer/src/services/render/observability.ts:49 still says the retry counter is for renders that ultimately succeeded; after d120568c it also records failed renders. Comment-only, not a blocker.

CI note: the initial required Test job failed on an unrelated @hyperframes/aws-lambda hook timeout; I reran it and the rerun passed. Remaining long-running Windows/regression checks were still pending at review time, with no failures.

Verdict: APPROVE
Reasoning: The telemetry is wired at the correct capture boundary, reaches the existing CLI telemetry pipeline without new PostHog plumbing, and the final commit covers Via's failed-render retry-burn gap without changing capture behavior.

— Magi

…pture hardening Follow-up to #1842 / #1843 review feedback (Via + Magi): the P2-5 capture hardening was log-only, so its effect is invisible on PostHog dashboard 1783183 until an incident. Thread two counters through the existing observability → CLI-telemetry pipeline (no new PostHog wiring): - **transient-retry burn** — `CaptureAttemptSummary.reason` gains a `"transient-retry"` variant (distinct from the worker-halving `"retry"`), so `executeDiskCaptureWithAdaptiveRetry` tags same-worker retries after a transient tab death. `executeRenderJob` counts them into `RenderCaptureObservability.transientRetries` on successful renders — answers "are we burning the retry budget but recovering?". - **OOM classification** — the top-level catch sets `RenderCaptureObservability.memoryExhaustionDetected` whenever `describeMemoryExhaustion` classifies the failure — answers "is OOM the dominant failure tail?". Both surface via `renderObservabilityTelemetryPayload` → `capture_transient_retries` and `capture_memory_exhaustion_detected` PostHog props on the render events. Tests: producer asserts the transient retry attempt is tagged `transient-retry`; new CLI test locks the payload mapping (present when set, undefined otherwise). Scope: the encoder-frame-0-exit signal (also log-only) and the P1-3 pre-flight- rejection / P1-4 cli_env_check counters remain a further follow-up — they live in different subsystems (streaming stage, CLI render/doctor commands). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Addresses Via's should-fix on #1850: the transient-retry counter previously fired only on the success path ("burned budget and recovered"). Mirror it into the failure path so "burned budget and STILL failed" is also emitted — the more actionable signal for tuning MAX_TRANSIENT_CAPTURE_RETRIES. Note: `captureAttempts` was declared `const` INSIDE the try, so it was NOT in scope in the catch (contrary to the review note). Hoisted it to function scope and extracted a shared `recordTransientRetryObservability` helper called from both the success path and the catch, so the two sites can't drift. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…timeouts (#1855) The CLI Test job (bun run --filter '!@hyperframes/producer' test) intermittently failed unrelated PRs (#1843, #1850) with `Test timed out in 5000ms` / `Hook timed out in 10000ms`. Root cause: multiple CLI tests cold-import a heavy command module graph via dynamic import() (render.js, auth/status.js, telemetry/system.js), which under the full parallel monorepo run contends for CPU and blows vitest's 5s/10s defaults on constrained runners. Not a product bug. Fix at the right altitude: set testTimeout 20s + hookTimeout 30s once in packages/cli/vitest.config.ts instead of per-test/per-hook bandaids, and remove the now-redundant explicit 30s beforeAll timeouts added in #1843. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vanceingalls reviewed Jul 2, 2026

View reviewed changes

miguel-heygen approved these changes Jul 2, 2026

View reviewed changes

jrusso1020 and others added 2 commits July 1, 2026 20:34

jrusso1020 force-pushed the fix/render-reliability-telemetry branch from d120568 to 65655ad Compare July 2, 2026 03:35

jrusso1020 merged commit 733f88c into main Jul 2, 2026
48 checks passed

jrusso1020 deleted the fix/render-reliability-telemetry branch July 2, 2026 03:57

This was referenced Jul 2, 2026

test(cli): de-flake cold-import tests under CI contention via vitest timeouts #1855

Merged

feat(cli): emit render_preflight_rejected telemetry for P1-3 pre-flight saves #1856

Merged

miguel-heygen mentioned this pull request Jul 2, 2026

chore: release v0.7.26 #1863

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(producer,cli): render-reliability telemetry counters for capture hardening#1850

feat(producer,cli): render-reliability telemetry counters for capture hardening#1850
jrusso1020 merged 2 commits into
mainfrom
fix/render-reliability-telemetry

jrusso1020 commented Jul 2, 2026

Uh oh!

vanceingalls left a comment

Uh oh!

jrusso1020 commented Jul 2, 2026

Uh oh!

miguel-heygen left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jrusso1020 commented Jul 2, 2026

What & why

Changes

Tests

Scope / follow-ups

Uh oh!

vanceingalls left a comment

Choose a reason for hiding this comment

🟢 LGTM — closes the #1842 telemetry-gap heads-up

capture_transient_retries — placement ✅

capture_memory_exhaustion_detected — placement ✅

🟠 should-fix (non-blocking) — failure-path coverage on transient-retry burn

CI

🟠 pr-envelope (fyi, non-blocking)

Uh oh!

jrusso1020 commented Jul 2, 2026

Uh oh!

miguel-heygen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants