[bot] ChatMessage and StreamDelta drop OpenAI audio output (gpt-4o-audio models)

## Summary

When an OpenAI audio model (`gpt-4o-audio-preview`, `gpt-4o-mini-audio-preview`) returns an audio response, the `choices[].message.audio` object is silently dropped. `ChatMessage` has no `audio` field, and `StreamDelta` has no `audio` field, so both the audio data and transcript are lost from logged span output. Users cannot see audio transcripts or distinguish audio responses from empty responses in Braintrust traces.

## What is missing

OpenAI Chat Completions with `modalities: ["text", "audio"]` returns an `audio` object on the assistant message:

**Non-streaming response:**
```json
{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": null,
      "audio": {
        "id": "audio_abc123",
        "data": "<base64-encoded-audio>",
        "expires_at": 1729268602,
        "transcript": "Sure, here is a poem about the ocean..."
      }
    }
  }]
}
```

**Streaming delta:**
```json
{"choices": [{"delta": {"audio": {"id": "audio_abc123"}}}]}
{"choices": [{"delta": {"audio": {"data": "base64chunk1..."}}}]}
{"choices": [{"delta": {"audio": {"transcript": "Sure, here"}}}]}
{"choices": [{"delta": {"audio": {"transcript": " is a poem"}}}]}
{"choices": [{"delta": {"audio": {"expires_at": 1729268602}}}]}
```

Currently in the SDK:

1. **`ChatMessage`** (`src/stream.rs`) has only `role`, `content`, and `tool_calls` — no `audio` field. Serde discards unknown fields during deserialization, so the `audio` object is silently dropped when aggregating non-streaming responses manually or when the output type is constructed.

2. **`StreamDelta`** (`src/stream.rs`) has only `role`, `content`, and `tool_calls` — no `audio` field. Incremental `delta.audio` chunks are never accumulated.

3. **`aggregate()`** builds `ChatMessage` from accumulated `content` only. Even if `StreamDelta` were extended with an `audio` field, there is no accumulation logic for the `transcript` string or the chunked base64 `data`.

This means for audio model responses:
- The `transcript` (the only human-readable content in an audio-only response) is lost from the span output
- The audio `id` and `expires_at` are lost
- The encoded audio `data` is dropped (expected, since it's large binary data)
- Spans for audio model calls appear as empty output with no indication of content

## Braintrust docs status

**unclear** — Braintrust's OpenAI integration page documents Chat Completions tracing including streaming, but does not explicitly mention audio output model support. The Braintrust proxy supports routing audio model calls (the proxy accepts any valid OpenAI API call), but SDK-level instrumentation of the `audio` response field is not documented.

- https://www.braintrust.dev/docs/integrations/ai-providers/openai — documents `wrapOpenAI` automatic tracing for chat completions; audio output not mentioned
- https://www.braintrust.dev/docs/instrument/trace-llm-calls — lists ChatCompletion support; no audio-specific entry

## Upstream sources

- OpenAI audio output guide: https://platform.openai.com/docs/guides/audio — documents `modalities: ["text", "audio"]`, the `audio` field on assistant messages, and streaming audio chunks
- OpenAI Chat Completions message object reference: https://platform.openai.com/docs/api-reference/chat/object — `choices[].message.audio` field with `id`, `data`, `expires_at`, `transcript`
- OpenAI audio models: `gpt-4o-audio-preview`, `gpt-4o-mini-audio-preview` — stable, documented, GA
- OpenAI Python SDK `ChatCompletionAudio` type defines `id`, `data`, `expires_at`, `transcript`

## Relationship to existing issues

- **Distinct from #50** (structured output `refusal` field): #50 is about the `refusal` string on the message delta. This issue is about the `audio` object — a different feature (multimodal audio generation) with a different field shape (nested object with chunked binary data and transcript).
- **Distinct from #31** (OpenAI Responses API): #31 covers the entirely different Responses API streaming event format. This issue is about Chat Completions (the API already partially supported) with audio modality enabled.
- **Distinct from #48** (multi-choice collapse): #48 is about choice index handling. This is about a missing field within a single choice's message.

## Local files inspected

- `src/stream.rs:312-323` — `ChatMessage` struct has `role`, `content`, `tool_calls`; no `audio` field
- `src/stream.rs:697-705` — `StreamDelta` struct has `role`, `content`, `tool_calls`; no `audio` field
- `src/stream.rs:840-1009` — `aggregate()` only accumulates `delta.content` (string) and `delta.tool_calls`; no audio transcript accumulation
- `src/stream.rs:1116-1140` — `value_has_content()` checks `choices` array and `usage` for TTFT detection; audio-only responses (where `content` is null) would not trigger TTFT recording
- Full codebase grep for `audio`, `transcript`, `modality`, `gpt-4o-audio` — zero results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bot] ChatMessage and StreamDelta drop OpenAI audio output (gpt-4o-audio models) #61

Summary

What is missing

Braintrust docs status

Upstream sources

Relationship to existing issues

Local files inspected

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[bot] ChatMessage and StreamDelta drop OpenAI audio output (gpt-4o-audio models) #61

Description

Summary

What is missing

Braintrust docs status

Upstream sources

Relationship to existing issues

Local files inspected

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions