Skip to content

fix(observability): retry session recording upload#1381

Open
toubatbrian wants to merge 2 commits intomainfrom
claude/quirky-galileo-W2Ccn
Open

fix(observability): retry session recording upload#1381
toubatbrian wants to merge 2 commits intomainfrom
claude/quirky-galileo-W2Ccn

Conversation

@toubatbrian
Copy link
Copy Markdown
Contributor

Summary

Port of livekit/agents#5627fix(observability): retry session recording upload — to agents-js.

When the LiveKit Cloud /observability/recordings/v0 endpoint rejects a session-recording upload with a retryable error, the server returns a google.rpc.Status body containing a RetryInfo detail with a retry_delay. Before this change, agents-js treated all non-2xx responses as fatal and dropped the recording. After this change, the client honors RetryInfo and retries the upload up to 3 times.

What changed

agents/src/telemetry/traces.tsuploadSessionReport():

  • The header bytes, chat-history JSON buffer, and audio file bytes are now read once before the retry loop.
  • A buildFormData() factory creates a fresh FormData per attempt (the form-data package consumes its inputs on submit, so a new instance is required for each retry).
  • submitOnce() wraps formData.submit(...) in a promise that always reads the response body to completion and returns { statusCode, statusMessage, body }. The body is needed both for error reporting and for RetryInfo parsing.
  • A retry loop runs up to maxRetries = 3 (4 attempts total). On a non-2xx response it calls parseRetryDelayMs(body). If a delay is returned, it logs a warning and awaits a setTimeout of that many milliseconds before the next attempt. If no RetryInfo is present, or attempts are exhausted, it throws with the status and body for diagnostics.

Implementation nuances vs. Python

The Python PR uses google.rpc.error_details_pb2.RetryInfo and google.rpc.status_pb2.Status from googleapis-common-protos. There is no equivalent runtime dependency on the JS side, and pulling one in for a single ~30-byte message is overkill. Instead, this PR adds a tiny inline protobuf wire-format reader (~80 lines) that walks Status → details (Any) → RetryInfo → retry_delay (Duration) and returns the delay.

Other notable adaptations:

  • Time units. Per the JS convention in CLAUDE.md, the parser returns milliseconds (number) rather than the Python helper's seconds (float). The retry loop sleeps via setTimeout(..., retryDelayMs). The log message still shows seconds ((retryDelayMs / 1000).toFixed(1)) to match the Python log format.
  • Type-URL match. The detail is recognized by the canonical type URL type.googleapis.com/google.rpc.RetryInfo. The Python implementation calls Any.Unpack(retry_info), which checks the same type URL under the hood.
  • Single read of the response body. The previous implementation already drained the body for error diagnostics; the new version always concatenates to a Buffer so it can be passed to the parser unchanged.
  • Audio read failure. If fs.readFile(audioRecordingPath) throws, audioBytes stays as Buffer.alloc(0) (same as before) and the audio part is omitted from the form on every retry.
  • int32 nanos. The Duration parser reads nanos as a varint and casts to Number, matching the protobuf encoding for non-negative int32 values. RetryInfo durations are always non-negative in practice.

Files

  • agents/src/telemetry/traces.ts — refactor uploadSessionReport + add parseRetryDelayMs and helpers.
  • .changeset/retry-recording-upload.md — patch changeset for @livekit/agents.

Test plan

  • pnpm --filter @livekit/agents exec tsc --noEmit passes.
  • pnpm --filter @livekit/agents build passes.
  • pnpm exec vitest run src/telemetry/traces.test.ts passes (5/5).
  • pnpm exec prettier --check passes for changed files.
  • Manual verification: trigger a 429/503 with a RetryInfo detail in a staging environment and confirm the upload retries with the server-provided delay.

cc @toubatbrian @livekit/agent-devs


This PR was opened automatically by a Claude Code routine. See the source PR comment on livekit/agents#5627 for context.

https://claude.ai/code/session_01LdAzmRxahVaMMGXAgStdrT


Generated by Claude Code

Port livekit/agents#5627 to JS. The session recording upload now retries
up to 3 times when the server response includes a google.rpc.RetryInfo
detail, sleeping for the server-provided delay between attempts. The
multipart form is rebuilt on each retry; header / chat history / audio
bytes are read once and reused.

A small protobuf wire-format reader is added to parse RetryInfo without
pulling in google-protos-common.
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 4, 2026

🦋 Changeset detected

Latest commit: cb6b677

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 28 packages
Name Type
@livekit/agents Patch
@livekit/agents-plugin-anam Patch
@livekit/agents-plugin-assemblyai Patch
@livekit/agents-plugin-baseten Patch
@livekit/agents-plugin-bey Patch
@livekit/agents-plugin-cartesia Patch
@livekit/agents-plugin-cerebras Patch
@livekit/agents-plugin-deepgram Patch
@livekit/agents-plugin-elevenlabs Patch
@livekit/agents-plugin-google Patch
@livekit/agents-plugin-hedra Patch
@livekit/agents-plugin-inworld Patch
@livekit/agents-plugin-lemonslice Patch
@livekit/agents-plugin-liveavatar Patch
@livekit/agents-plugin-livekit Patch
@livekit/agents-plugin-minimax Patch
@livekit/agents-plugin-mistral Patch
@livekit/agents-plugin-neuphonic Patch
@livekit/agents-plugin-openai Patch
@livekit/agents-plugin-phonic Patch
@livekit/agents-plugin-resemble Patch
@livekit/agents-plugin-rime Patch
@livekit/agents-plugin-runway Patch
@livekit/agents-plugin-sarvam Patch
@livekit/agents-plugin-silero Patch
@livekit/agents-plugins-test Patch
@livekit/agents-plugin-trugen Patch
@livekit/agents-plugin-xai Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

Open in Devin Review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eb3dd0c4ec

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread agents/src/telemetry/traces.ts Outdated
offset += skipField(any, offset, wireType);
}
}
if (typeUrl !== RETRY_INFO_TYPE_URL || value === null) return null;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match RetryInfo Any type by suffix, not exact URL

google.protobuf.Any type URLs are only required to end with the fully-qualified message name, so checking for exact equality with type.googleapis.com/google.rpc.RetryInfo can miss valid RetryInfo details that use a different prefix. In those responses, retries are skipped and uploads fail immediately despite the server explicitly returning retry guidance. Please match typeUrl by its last path segment (or suffix) to mirror Any.Unpack semantics.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in cb6b677. The check now compares the segment after the last / in Any.type_url against the fully-qualified message name (google.rpc.RetryInfo), matching the Python Any.Unpack semantics.


Generated by Claude Code

Address review feedback on #1381: google.protobuf.Any.type_url only
requires the segment after the last "/" to equal the fully-qualified
message name, so an exact match against
"type.googleapis.com/google.rpc.RetryInfo" would skip retries when the
server uses a different prefix. Match by suffix to mirror Python's
Any.Unpack semantics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants