Summary
When an OpenAI audio model (gpt-4o-audio-preview, gpt-4o-mini-audio-preview) returns an audio response, the choices[].message.audio object is silently dropped. ChatMessage has no audio field, and StreamDelta has no audio field, so both the audio data and transcript are lost from logged span output. Users cannot see audio transcripts or distinguish audio responses from empty responses in Braintrust traces.
What is missing
OpenAI Chat Completions with modalities: ["text", "audio"] returns an audio object on the assistant message:
Non-streaming response:
{
"choices": [{
"message": {
"role": "assistant",
"content": null,
"audio": {
"id": "audio_abc123",
"data": "<base64-encoded-audio>",
"expires_at": 1729268602,
"transcript": "Sure, here is a poem about the ocean..."
}
}
}]
}
Streaming delta:
{"choices": [{"delta": {"audio": {"id": "audio_abc123"}}}]}
{"choices": [{"delta": {"audio": {"data": "base64chunk1..."}}}]}
{"choices": [{"delta": {"audio": {"transcript": "Sure, here"}}}]}
{"choices": [{"delta": {"audio": {"transcript": " is a poem"}}}]}
{"choices": [{"delta": {"audio": {"expires_at": 1729268602}}}]}
Currently in the SDK:
-
ChatMessage (src/stream.rs) has only role, content, and tool_calls — no audio field. Serde discards unknown fields during deserialization, so the audio object is silently dropped when aggregating non-streaming responses manually or when the output type is constructed.
-
StreamDelta (src/stream.rs) has only role, content, and tool_calls — no audio field. Incremental delta.audio chunks are never accumulated.
-
aggregate() builds ChatMessage from accumulated content only. Even if StreamDelta were extended with an audio field, there is no accumulation logic for the transcript string or the chunked base64 data.
This means for audio model responses:
- The
transcript (the only human-readable content in an audio-only response) is lost from the span output
- The audio
id and expires_at are lost
- The encoded audio
data is dropped (expected, since it's large binary data)
- Spans for audio model calls appear as empty output with no indication of content
Braintrust docs status
unclear — Braintrust's OpenAI integration page documents Chat Completions tracing including streaming, but does not explicitly mention audio output model support. The Braintrust proxy supports routing audio model calls (the proxy accepts any valid OpenAI API call), but SDK-level instrumentation of the audio response field is not documented.
Upstream sources
- OpenAI audio output guide: https://platform.openai.com/docs/guides/audio — documents
modalities: ["text", "audio"], the audio field on assistant messages, and streaming audio chunks
- OpenAI Chat Completions message object reference: https://platform.openai.com/docs/api-reference/chat/object —
choices[].message.audio field with id, data, expires_at, transcript
- OpenAI audio models:
gpt-4o-audio-preview, gpt-4o-mini-audio-preview — stable, documented, GA
- OpenAI Python SDK
ChatCompletionAudio type defines id, data, expires_at, transcript
Relationship to existing issues
Local files inspected
src/stream.rs:312-323 — ChatMessage struct has role, content, tool_calls; no audio field
src/stream.rs:697-705 — StreamDelta struct has role, content, tool_calls; no audio field
src/stream.rs:840-1009 — aggregate() only accumulates delta.content (string) and delta.tool_calls; no audio transcript accumulation
src/stream.rs:1116-1140 — value_has_content() checks choices array and usage for TTFT detection; audio-only responses (where content is null) would not trigger TTFT recording
- Full codebase grep for
audio, transcript, modality, gpt-4o-audio — zero results
Summary
When an OpenAI audio model (
gpt-4o-audio-preview,gpt-4o-mini-audio-preview) returns an audio response, thechoices[].message.audioobject is silently dropped.ChatMessagehas noaudiofield, andStreamDeltahas noaudiofield, so both the audio data and transcript are lost from logged span output. Users cannot see audio transcripts or distinguish audio responses from empty responses in Braintrust traces.What is missing
OpenAI Chat Completions with
modalities: ["text", "audio"]returns anaudioobject on the assistant message:Non-streaming response:
{ "choices": [{ "message": { "role": "assistant", "content": null, "audio": { "id": "audio_abc123", "data": "<base64-encoded-audio>", "expires_at": 1729268602, "transcript": "Sure, here is a poem about the ocean..." } } }] }Streaming delta:
{"choices": [{"delta": {"audio": {"id": "audio_abc123"}}}]} {"choices": [{"delta": {"audio": {"data": "base64chunk1..."}}}]} {"choices": [{"delta": {"audio": {"transcript": "Sure, here"}}}]} {"choices": [{"delta": {"audio": {"transcript": " is a poem"}}}]} {"choices": [{"delta": {"audio": {"expires_at": 1729268602}}}]}Currently in the SDK:
ChatMessage(src/stream.rs) has onlyrole,content, andtool_calls— noaudiofield. Serde discards unknown fields during deserialization, so theaudioobject is silently dropped when aggregating non-streaming responses manually or when the output type is constructed.StreamDelta(src/stream.rs) has onlyrole,content, andtool_calls— noaudiofield. Incrementaldelta.audiochunks are never accumulated.aggregate()buildsChatMessagefrom accumulatedcontentonly. Even ifStreamDeltawere extended with anaudiofield, there is no accumulation logic for thetranscriptstring or the chunked base64data.This means for audio model responses:
transcript(the only human-readable content in an audio-only response) is lost from the span outputidandexpires_atare lostdatais dropped (expected, since it's large binary data)Braintrust docs status
unclear — Braintrust's OpenAI integration page documents Chat Completions tracing including streaming, but does not explicitly mention audio output model support. The Braintrust proxy supports routing audio model calls (the proxy accepts any valid OpenAI API call), but SDK-level instrumentation of the
audioresponse field is not documented.wrapOpenAIautomatic tracing for chat completions; audio output not mentionedUpstream sources
modalities: ["text", "audio"], theaudiofield on assistant messages, and streaming audio chunkschoices[].message.audiofield withid,data,expires_at,transcriptgpt-4o-audio-preview,gpt-4o-mini-audio-preview— stable, documented, GAChatCompletionAudiotype definesid,data,expires_at,transcriptRelationship to existing issues
refusalfield #50 (structured outputrefusalfield): [bot] Streaming aggregator and output types drop the structured outputrefusalfield #50 is about therefusalstring on the message delta. This issue is about theaudioobject — a different feature (multimodal audio generation) with a different field shape (nested object with chunked binary data and transcript).Local files inspected
src/stream.rs:312-323—ChatMessagestruct hasrole,content,tool_calls; noaudiofieldsrc/stream.rs:697-705—StreamDeltastruct hasrole,content,tool_calls; noaudiofieldsrc/stream.rs:840-1009—aggregate()only accumulatesdelta.content(string) anddelta.tool_calls; no audio transcript accumulationsrc/stream.rs:1116-1140—value_has_content()checkschoicesarray andusagefor TTFT detection; audio-only responses (wherecontentis null) would not trigger TTFT recordingaudio,transcript,modality,gpt-4o-audio— zero results