Skip to content

[bot] BraintrustStream cannot aggregate OpenAI Responses API streaming events #66

@braintrust-bot

Description

@braintrust-bot

Summary

BraintrustStream and wrap_stream_with_span only handle OpenAI Chat Completions streaming chunks (choices[].delta). The OpenAI Responses API (GA March 2025) uses a completely different streaming event format — typed server-sent events such as response.output_text.delta, response.function_call_arguments_delta, and response.completed. All Responses API streaming events are silently discarded: each event parses as an empty StreamChunk (all fields default to None/[]) and no output text, tool call arguments, model name, usage metrics, or TTFT are captured.

This is distinct from #44 (async-openai client wrapper), which covers adding a wrapper around the async-openai library. This issue is specifically about the BraintrustStream aggregation path, which is a provider-format-agnostic surface that users call directly.

What is missing

The OpenAI Responses API streaming emits typed events, each with a type field. Key examples:

// Text delta (not choices[].delta.content)
{"type": "response.output_text.delta", "item_id": "msg_abc", "output_index": 0, "content_index": 0, "delta": "Hello, how can I help?"}

// Tool call arguments delta (not choices[].delta.tool_calls)
{"type": "response.function_call_arguments_delta", "item_id": "fc_xyz", "output_index": 1, "call_id": "call_001", "delta": "{\"location\": \"NYC\"}"}

// Reasoning summary delta (for o1/o3/o4 models)
{"type": "response.reasoning_summary_text.delta", "item_id": "rs_abc", "output_index": 0, "summary_index": 0, "delta": "Let me think through this..."}

// Final event — usage is nested under "response", not at root
{
  "type": "response.completed",
  "response": {
    "id": "resp_001",
    "model": "gpt-4o-2024-11-20",
    "output": [{"type": "message", "role": "assistant", "content": [{"type": "output_text", "text": "Hello, how can I help?"}]}],
    "usage": {
      "input_tokens": 50,
      "output_tokens": 25,
      "total_tokens": 75,
      "output_tokens_details": {"reasoning_tokens": 0}
    }
  }
}

Key structural differences from Chat Completions streaming:

  1. No choices field: Text content is at delta (a string), not choices[].delta.content
  2. No root-level model: The model name is only present in the final response.completed event, nested under response.model
  3. No root-level usage: Token counts are in response.completed.response.usage, not at the top-level usage key
  4. Typed events: Each event has a type discriminant; content and tool calls are separate event types
  5. New tool types: Built-in tools (response.file_search_call_*, response.code_interpreter_call_*, response.mcp_call_*) have no equivalent in Chat Completions streaming

Failure mode in current SDK

StreamChunk (src/stream.rs:687-694) is defined with all #[serde(default)] fields:

struct StreamChunk {
    #[serde(default)]
    model: Option<String>,
    #[serde(default)]
    choices: Vec<StreamChoice>,
    #[serde(default)]
    usage: Option<StreamUsage>,
}

Because serde ignores unknown fields by default (no #[serde(deny_unknown_fields)]), serde_json::from_value on any Responses API event succeeds — but produces a StreamChunk with model: None, choices: [], and usage: None. The Err(_) => continue fallback at line 856 is never hit. Every chunk processes without error but all content and metrics are silently dropped:

  • Text output from response.output_text.delta events is lost (no choices field)
  • Tool call arguments from response.function_call_arguments_delta are lost
  • Reasoning summary from response.reasoning_summary_text.delta is lost
  • Model name from response.completed.response.model is lost (nested under response, not root)
  • Usage metrics (input_tokens, output_tokens, output_tokens_details.reasoning_tokens) from response.completed.response.usage are never extracted (nested under response, not at root usage key)
  • Finish reason is not captured
  • TTFT metric is not recorded (value_has_content() at src/stream.rs:1117-1119 checks for non-empty choices, always empty for Responses API events)

Braintrust docs status

unclear — Braintrust documents OpenAI instrumentation (wrapOpenAI in TypeScript, wrap_openai in Python) but does not explicitly mention Responses API support for the Rust SDK. The OpenAI integration page focuses on Chat Completions and does not address the Responses API. Rust is not listed as a supported language for automatic LLM call tracing on the Trace LLM calls page.

Upstream sources

Relationship to existing issues

Local files inspected

  • src/stream.rs:687-694StreamChunk struct has only model, choices, usage with #[serde(default)]; Responses API events have a type field and delta string, none of which match
  • src/stream.rs:840-857aggregate() calls serde_json::from_value; Responses API events silently deserialize to empty StreamChunk objects without hitting the Err(_) => continue fallback
  • src/stream.rs:1117-1119value_has_content() checks choices array; always empty for Responses API events, so TTFT is never recorded
  • src/extractors.rsextract_openai_usage() calls value.get("usage") at line 5; response.completed wraps usage under response.usage not at root, so extraction would return UsageMetrics::default() even if the final event were parsed
  • src/lib.rs — public API exports; no Responses API references
  • Full codebase grep for response.output_text, response.completed, ResponseOutputText, output_index, summary_index — zero results

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions