[bot] BraintrustStream cannot aggregate OpenAI Responses API streaming events

## Summary

`BraintrustStream` and `wrap_stream_with_span` only handle OpenAI Chat Completions streaming chunks (`choices[].delta`). The OpenAI Responses API (GA March 2025) uses a completely different streaming event format — typed server-sent events such as `response.output_text.delta`, `response.function_call_arguments_delta`, and `response.completed`. All Responses API streaming events are silently discarded: each event parses as an empty `StreamChunk` (all fields default to `None`/`[]`) and no output text, tool call arguments, model name, usage metrics, or TTFT are captured.

This is distinct from #44 (async-openai client wrapper), which covers adding a wrapper around the `async-openai` library. This issue is specifically about the `BraintrustStream` aggregation path, which is a provider-format-agnostic surface that users call directly.

## What is missing

The OpenAI Responses API streaming emits typed events, each with a `type` field. Key examples:

```json
// Text delta (not choices[].delta.content)
{"type": "response.output_text.delta", "item_id": "msg_abc", "output_index": 0, "content_index": 0, "delta": "Hello, how can I help?"}

// Tool call arguments delta (not choices[].delta.tool_calls)
{"type": "response.function_call_arguments_delta", "item_id": "fc_xyz", "output_index": 1, "call_id": "call_001", "delta": "{\"location\": \"NYC\"}"}

// Reasoning summary delta (for o1/o3/o4 models)
{"type": "response.reasoning_summary_text.delta", "item_id": "rs_abc", "output_index": 0, "summary_index": 0, "delta": "Let me think through this..."}

// Final event — usage is nested under "response", not at root
{
  "type": "response.completed",
  "response": {
    "id": "resp_001",
    "model": "gpt-4o-2024-11-20",
    "output": [{"type": "message", "role": "assistant", "content": [{"type": "output_text", "text": "Hello, how can I help?"}]}],
    "usage": {
      "input_tokens": 50,
      "output_tokens": 25,
      "total_tokens": 75,
      "output_tokens_details": {"reasoning_tokens": 0}
    }
  }
}
```

Key structural differences from Chat Completions streaming:

1. **No `choices` field**: Text content is at `delta` (a string), not `choices[].delta.content`
2. **No root-level `model`**: The model name is only present in the final `response.completed` event, nested under `response.model`
3. **No root-level `usage`**: Token counts are in `response.completed.response.usage`, not at the top-level `usage` key
4. **Typed events**: Each event has a `type` discriminant; content and tool calls are separate event types
5. **New tool types**: Built-in tools (`response.file_search_call_*`, `response.code_interpreter_call_*`, `response.mcp_call_*`) have no equivalent in Chat Completions streaming

### Failure mode in current SDK

`StreamChunk` (`src/stream.rs:687-694`) is defined with all `#[serde(default)]` fields:

```rust
struct StreamChunk {
    #[serde(default)]
    model: Option<String>,
    #[serde(default)]
    choices: Vec<StreamChoice>,
    #[serde(default)]
    usage: Option<StreamUsage>,
}
```

Because serde ignores unknown fields by default (no `#[serde(deny_unknown_fields)]`), `serde_json::from_value` on any Responses API event **succeeds** — but produces a `StreamChunk` with `model: None`, `choices: []`, and `usage: None`. The `Err(_) => continue` fallback at line 856 is never hit. Every chunk processes without error but all content and metrics are silently dropped:

- **Text output** from `response.output_text.delta` events is lost (no `choices` field)
- **Tool call arguments** from `response.function_call_arguments_delta` are lost
- **Reasoning summary** from `response.reasoning_summary_text.delta` is lost
- **Model name** from `response.completed.response.model` is lost (nested under `response`, not root)
- **Usage metrics** (`input_tokens`, `output_tokens`, `output_tokens_details.reasoning_tokens`) from `response.completed.response.usage` are never extracted (nested under `response`, not at root `usage` key)
- **Finish reason** is not captured
- **TTFT metric** is not recorded (`value_has_content()` at `src/stream.rs:1117-1119` checks for non-empty `choices`, always empty for Responses API events)

## Braintrust docs status

**unclear** — Braintrust documents OpenAI instrumentation (`wrapOpenAI` in TypeScript, `wrap_openai` in Python) but does not explicitly mention Responses API support for the Rust SDK. The [OpenAI integration page](https://www.braintrust.dev/docs/integrations/ai-providers/openai) focuses on Chat Completions and does not address the Responses API. Rust is not listed as a supported language for automatic LLM call tracing on the [Trace LLM calls page](https://www.braintrust.dev/docs/instrument/trace-llm-calls).

## Upstream sources

- OpenAI Responses API overview (streaming designed-in, typed events): https://developers.openai.com/api/docs/guides/streaming-responses
- OpenAI Responses API vs Chat Completions migration guide (output format, `choices` → `output` array): https://developers.openai.com/api/docs/guides/migrate-to-responses
- OpenAI Node SDK type definitions (streaming event types `ResponseOutputTextDeltaEvent`, `ResponseCompletedEvent`, `ResponseFunctionCallArgumentsDeltaEvent`): https://github.com/openai/openai-node/blob/master/src/resources/responses/responses.ts
- async-openai 0.35.0 changelog (Responses API streaming added): https://crates.io/crates/async-openai

## Relationship to existing issues

- **Distinct from #44** (async-openai client wrapper): #44 covers adding a `wrap_openai`-style wrapper around the `async-openai` library. This issue covers the `BraintrustStream` / `wrap_stream_with_span` streaming format aggregation gap, which affects any user calling the Responses API and feeding the resulting stream to `wrap_stream_with_span` directly — independent of client library.
- **Analogous to #62** (Anthropic SSE streaming), **#60** (Gemini streaming), **#64** (Bedrock ConverseStream streaming), and **#65** (Cohere v2 streaming): Those issues cover the same class of failure (provider streaming format incompatible with `BraintrustStream`) for different providers. This issue covers the OpenAI Responses API as the same class of problem from the same upstream vendor whose Chat Completions format is already supported.

## Local files inspected

- `src/stream.rs:687-694` — `StreamChunk` struct has only `model`, `choices`, `usage` with `#[serde(default)]`; Responses API events have a `type` field and `delta` string, none of which match
- `src/stream.rs:840-857` — `aggregate()` calls `serde_json::from_value`; Responses API events silently deserialize to empty `StreamChunk` objects without hitting the `Err(_) => continue` fallback
- `src/stream.rs:1117-1119` — `value_has_content()` checks `choices` array; always empty for Responses API events, so TTFT is never recorded
- `src/extractors.rs` — `extract_openai_usage()` calls `value.get("usage")` at line 5; `response.completed` wraps usage under `response.usage` not at root, so extraction would return `UsageMetrics::default()` even if the final event were parsed
- `src/lib.rs` — public API exports; no Responses API references
- Full codebase grep for `response.output_text`, `response.completed`, `ResponseOutputText`, `output_index`, `summary_index` — zero results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bot] BraintrustStream cannot aggregate OpenAI Responses API streaming events #66

Summary

What is missing

Failure mode in current SDK

Braintrust docs status

Upstream sources

Relationship to existing issues

Local files inspected

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[bot] BraintrustStream cannot aggregate OpenAI Responses API streaming events #66

Description

Summary

What is missing

Failure mode in current SDK

Braintrust docs status

Upstream sources

Relationship to existing issues

Local files inspected

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions