Skip to content

feat(llm): add real-time token streaming through provider chain#2872

Open
rizgan wants to merge 7 commits intonearai:stagingfrom
rizgan:feat/llm-provider-streaming
Open

feat(llm): add real-time token streaming through provider chain#2872
rizgan wants to merge 7 commits intonearai:stagingfrom
rizgan:feat/llm-provider-streaming

Conversation

@rizgan
Copy link
Copy Markdown

@rizgan rizgan commented Apr 22, 2026

Add streaming capability to the LLM provider system so clients receive tokens progressively rather than waiting for the full response.

New provider

OpenAiCompatStreamingProvider (src/llm/openai_compat_stream.rs):

  • Wraps any OpenAI-compatible backend (OpenRouter, Groq, etc.)
  • Sends POST with stream: true and stream_options.include_usage: true
  • Parses SSE data: lines, accumulates delta content and tool_calls
  • Implements both complete_stream and complete_with_tools_stream
  • Falls back gracefully to the inner provider for non-streaming paths
  • Used by the registry factory for all openai_compat backends

Trait changes (src/llm/provider.rs)

Two new default methods on LlmProvider:

  • complete_stream(request, on_chunk) -- default: single-chunk fallback
  • complete_with_tools_stream(request, on_chunk) -- same fallback

Default impls preserve backward compatibility for all existing providers.

Provider chain delegation

All wrapper providers now forward streaming calls instead of falling back to the non-streaming default:

  • RetryProvider -- delegates directly (streams are not retried)
  • SmartRoutingProvider -- delegates to primary
  • FailoverProvider -- delegates to last-used provider
  • CircuitBreakerProvider-- with full check_allowed/record_success/failure
  • CachedProvider -- bypasses cache (streaming takes priority)
  • RecordingLlm -- bypasses recording (chunks are not replayable)

Wiring (src/llm/reasoning.rs, src/agent/dispatcher.rs)

ReasoningContext gains an optional chunk_sender field. ChatDelegate::call_llm() spawns a forwarder task that reads from the channel and calls channels.send_status(StreamChunk), which the web gateway broadcasts as SSE stream_chunk events to connected clients.

Summary

Change Type

  • Bug fix
  • New feature
  • Refactor
  • Documentation
  • CI/Infrastructure
  • Security
  • Dependencies

Linked Issue

Validation

  • cargo fmt --all -- --check
  • cargo clippy --all --benches --tests --examples --all-features -- -D warnings
  • cargo build
  • Relevant tests pass:
  • cargo test --features integration if database-backed or integration behavior changed
  • Manual testing:
  • If a coding agent was used and supports it, review-pr or pr-shepherd --fix was run before requesting review

Security Impact

Database Impact

Blast Radius

Rollback Plan

Review Follow-Through


Review track:

Add streaming capability to the LLM provider system so clients
receive tokens progressively rather than waiting for the full response.

## New provider

OpenAiCompatStreamingProvider (src/llm/openai_compat_stream.rs):
- Wraps any OpenAI-compatible backend (OpenRouter, Groq, etc.)
- Sends POST with stream: true and stream_options.include_usage: true
- Parses SSE data: lines, accumulates delta content and tool_calls
- Implements both complete_stream and complete_with_tools_stream
- Falls back gracefully to the inner provider for non-streaming paths
- Used by the registry factory for all openai_compat backends

## Trait changes (src/llm/provider.rs)

Two new default methods on LlmProvider:
- complete_stream(request, on_chunk) -- default: single-chunk fallback
- complete_with_tools_stream(request, on_chunk) -- same fallback

Default impls preserve backward compatibility for all existing providers.

## Provider chain delegation

All wrapper providers now forward streaming calls instead of falling
back to the non-streaming default:

- RetryProvider         -- delegates directly (streams are not retried)
- SmartRoutingProvider  -- delegates to primary
- FailoverProvider      -- delegates to last-used provider
- CircuitBreakerProvider-- with full check_allowed/record_success/failure
- CachedProvider        -- bypasses cache (streaming takes priority)
- RecordingLlm          -- bypasses recording (chunks are not replayable)

## Wiring (src/llm/reasoning.rs, src/agent/dispatcher.rs)

ReasoningContext gains an optional chunk_sender field.
ChatDelegate::call_llm() spawns a forwarder task that reads from the
channel and calls channels.send_status(StreamChunk), which the web
gateway broadcasts as SSE stream_chunk events to connected clients.
Copilot AI review requested due to automatic review settings April 22, 2026 23:01
@github-actions github-actions Bot added scope: agent Agent core (agent loop, router, scheduler) scope: llm LLM integration size: XL 500+ changed lines risk: medium Business logic, config, or moderate-risk modules contributor: new First-time contributor labels Apr 22, 2026
@rizgan
Copy link
Copy Markdown
Author

rizgan commented Apr 22, 2026

This PR fixes a silent streaming regression where all LLM responses were delivered
as a single chunk instead of token-by-token, even when the provider supports SSE streaming.

Root cause: The provider chain wrappers (RetryProvider → SmartRoutingProvider →
FailoverProvider → CircuitBreakerProvider → CachedProvider → RecordingLlm) did not
override complete_stream / complete_with_tools_stream, so every call fell through
to the default single-chunk fallback in the trait.

What this PR does:

  • Adds complete_stream and complete_with_tools_stream default methods to LlmProvider
    (backward-compatible fallback: delivers the full response as one chunk)
  • Adds OpenAiCompatStreamingProvider — a streaming-capable wrapper for OpenAI-compatible
    backends (OpenRouter, Groq, etc.) that sends stream: true and parses SSE deltas
  • Wires all six wrapper providers to delegate streaming calls instead of silently falling back
  • Adds chunk_sender to ReasoningContext and updates both LLM call sites in
    respond_with_tools() to use streaming when a sender is present
  • ChatDelegate::call_llm() spawns a forwarder task that converts received chunks into
    StatusUpdate::StreamChunk events, which the web gateway broadcasts as SSE to clients

Tested with OpenRouter (openai/gpt-4o-mini) via a Telegram bot — tokens now appear
progressively in the chat.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces real-time token streaming for LLM providers, specifically targeting OpenAI-compatible endpoints. It adds complete_stream and complete_with_tools_stream methods to the LlmProvider trait and implements them across various provider wrappers, including circuit breakers, failover, and retry mechanisms. A new OpenAiCompatStreamingProvider is added to handle the SSE delta protocol. The ReasoningContext now supports a chunk_sender to facilitate token delivery to the client. Feedback focuses on improving system stability by using a bounded channel for token streaming to prevent memory exhaustion and adding a request timeout to the HTTP client to avoid hanging tasks.

Comment thread src/agent/dispatcher.rs Outdated
Comment on lines +631 to +632
let (chunk_tx, mut chunk_rx) =
tokio::sync::mpsc::unbounded_channel::<String>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using an unbounded_channel for token streaming can lead to unbounded memory growth if the LLM produces tokens faster than the channel consumer (the status update task) can process them. This is particularly risky if send_status involves network I/O or database operations that might stall. Consider using a bounded_channel with a reasonable capacity (e.g., 100) to provide backpressure.

Suggested change
let (chunk_tx, mut chunk_rx) =
tokio::sync::mpsc::unbounded_channel::<String>();
let (chunk_tx, mut chunk_rx) =
tokio::sync::mpsc::channel::<String>(100);

Comment thread src/llm/openai_compat_stream.rs Outdated
Comment on lines +56 to +58
let client = reqwest::Client::builder()
.connect_timeout(Duration::from_secs(30))
.build()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The reqwest::Client is configured with a connect_timeout but lacks a total request timeout or a read timeout. While the comment mentions avoiding cutting off long streams, an entirely absent timeout can lead to tasks hanging indefinitely if the upstream server maintains an open connection but stops sending data. Consider adding a timeout or using a read_timeout to ensure the stream eventually terminates if stalled.

Suggested change
let client = reqwest::Client::builder()
.connect_timeout(Duration::from_secs(30))
.build()
let client = reqwest::Client::builder()
.connect_timeout(Duration::from_secs(30))
.timeout(Duration::from_secs(600))
.build()

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds real-time token streaming to the LLM provider abstraction and wires it through the provider chain so clients can receive incremental output via SSE-style chunk forwarding.

Changes:

  • Extends LlmProvider with streaming methods (defaulting to single-chunk fallback for backward compatibility).
  • Delegates streaming through wrapper providers (retry/failover/circuit breaker/cache/recording/smart routing).
  • Introduces an OpenAI-compatible streaming wrapper and wires chunk forwarding through ReasoningContext → dispatcher channel updates.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/llm/provider.rs Adds default complete_stream / complete_with_tools_stream methods to the provider trait.
src/llm/openai_compat_stream.rs New OpenAI-compatible SSE streaming implementation that accumulates deltas + tool_calls.
src/llm/mod.rs Wraps registry OpenAI-compatible providers with the new streaming provider.
src/llm/reasoning.rs Emits streaming chunks to an optional sender during LLM calls.
src/agent/dispatcher.rs Spawns a forwarder task that relays chunks to channel status updates.
src/llm/smart_routing.rs Forwards streaming calls to primary provider.
src/llm/retry.rs Forwards streaming calls to inner provider (no retries for streams).
src/llm/failover.rs Forwards streaming calls to last-used provider.
src/llm/circuit_breaker.rs Adds streaming support with circuit breaker accounting.
src/llm/response_cache.rs Bypasses cache for streaming calls.
src/llm/recording.rs Bypasses recording for streaming calls.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/agent/dispatcher.rs
Comment on lines +629 to +648
// Wire up real-time token streaming to the channel layer.
{
let (chunk_tx, mut chunk_rx) =
tokio::sync::mpsc::unbounded_channel::<String>();
let channels = Arc::clone(&self.agent.channels);
let channel_name = self.message.channel.clone();
let metadata = self.message.metadata.clone();
tokio::spawn(async move {
while let Some(chunk) = chunk_rx.recv().await {
let _ = channels
.send_status(
&channel_name,
crate::channels::StatusUpdate::StreamChunk(chunk),
&metadata,
)
.await;
}
});
reason_ctx.chunk_sender = Some(chunk_tx);
}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The streaming forwarder uses an unbounded mpsc channel and awaits send_status() for every chunk. If the provider streams faster than the channel layer can deliver, the unbounded queue can grow without bound and increase memory usage. Prefer a bounded channel with try_send (dropping/coalescing when full) or another backpressure strategy suitable for token streaming.

Copilot uses AI. Check for mistakes.
Comment on lines +84 to +247
/// POST `body` (with `"stream": true` already set) to the completions
/// endpoint, parse the SSE delta stream, and return the accumulated result.
async fn stream_request(
&self,
body: serde_json::Value,
on_chunk: &mut (dyn FnMut(String) + Send),
) -> Result<OaiStreamResult, LlmError> {
let url = self.completions_url();

let mut builder = self
.client
.post(&url)
.header("Authorization", format!("Bearer {}", self.api_key))
.header("Content-Type", "application/json")
.header("Accept", "text/event-stream");

for (k, v) in &self.extra_headers {
builder = builder.header(k.as_str(), v.as_str());
}

let response = builder.json(&body).send().await.map_err(|e| {
LlmError::RequestFailed {
provider: "openai_compat".to_string(),
reason: e.to_string(),
}
})?;

let status = response.status();
if !status.is_success() {
let code = status.as_u16();
let retry_after = Some(crate::llm::retry::parse_retry_after(
response.headers().get("retry-after"),
));
let text = response.text().await.unwrap_or_default();
let truncated = crate::agent::truncate_for_preview(&text, 512);
return Err(match code {
401 | 403 => LlmError::AuthFailed {
provider: "openai_compat".to_string(),
},
429 => LlmError::RateLimited {
provider: "openai_compat".to_string(),
retry_after,
},
_ => LlmError::RequestFailed {
provider: "openai_compat".to_string(),
reason: format!("HTTP {}: {}", status, truncated),
},
});
}

let mut result = OaiStreamResult::default();
// BTreeMap keyed by tool_call index — OpenAI streams tool_call arguments
// as incremental string deltas that must be concatenated in order.
let mut tool_acc: BTreeMap<u32, PartialTool> = BTreeMap::new();

let stream = response
.bytes_stream()
.map(|chunk| chunk.map_err(|e| e.to_string()));
let mut event_stream = stream.eventsource();

while let Some(event) = event_stream.next().await {
let event = event.map_err(|e| LlmError::RequestFailed {
provider: "openai_compat".to_string(),
reason: format!("SSE stream error: {}", e),
})?;

let data = event.data.trim();
if data == "[DONE]" {
break;
}
if data.is_empty() {
continue;
}

let parsed: serde_json::Value = match serde_json::from_str(data) {
Ok(v) => v,
Err(_) => continue,
};

if let Some(choices) = parsed.get("choices").and_then(|c| c.as_array())
&& let Some(choice) = choices.first()
{
if let Some(fr) = choice.get("finish_reason").and_then(|v| v.as_str()) {
result.finish_reason = match fr {
"stop" => FinishReason::Stop,
"length" => FinishReason::Length,
"tool_calls" => FinishReason::ToolUse,
"content_filter" => FinishReason::ContentFilter,
_ => result.finish_reason,
};
}

if let Some(delta) = choice.get("delta") {
if let Some(content) = delta.get("content").and_then(|c| c.as_str())
&& !content.is_empty()
{
result.content.push_str(content);
on_chunk(content.to_string());
}

if let Some(tcs) = delta.get("tool_calls").and_then(|tc| tc.as_array()) {
for tc in tcs {
let idx = tc
.get("index")
.and_then(|v| v.as_u64())
.unwrap_or(0) as u32;
let entry = tool_acc.entry(idx).or_default();
if let Some(id) = tc.get("id").and_then(|v| v.as_str())
&& !id.is_empty()
{
entry.id = id.to_string();
}
if let Some(func) = tc.get("function") {
if let Some(name) =
func.get("name").and_then(|v| v.as_str())
&& !name.is_empty()
{
entry.name = name.to_string();
}
if let Some(args) =
func.get("arguments").and_then(|v| v.as_str())
{
entry.arguments.push_str(args);
}
}
}
}
}
}

// Usage is typically in the last chunk when stream_options.include_usage is set.
if let Some(usage) = parsed.get("usage") {
result.input_tokens = saturate_u32(
usage
.get("prompt_tokens")
.and_then(|v| v.as_u64())
.unwrap_or(0),
);
result.output_tokens = saturate_u32(
usage
.get("completion_tokens")
.and_then(|v| v.as_u64())
.unwrap_or(0),
);
}
}

result.tool_calls = tool_acc
.into_values()
.filter(|p| !p.name.is_empty())
.map(|p| {
let arguments = serde_json::from_str::<serde_json::Value>(&p.arguments)
.unwrap_or_else(|_| serde_json::Value::Object(Default::default()));
ToolCall {
id: p.id,
name: p.name,
arguments,
reasoning: None,
}
})
.collect();

Ok(result)
}
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenAiCompatStreamingProvider::stream_request contains substantial SSE parsing logic (delta accumulation, finish_reason mapping, tool_call reconstruction, usage extraction) but there are no unit tests covering common and edge-case streams (multiple tool_calls, missing usage, malformed JSON events, etc.). Other streaming parsers in this repo include targeted tests (e.g. src/llm/openai_codex_provider.rs). Adding focused tests here would help prevent regressions across OpenAI-compatible backends.

Copilot uses AI. Check for mistakes.
Comment thread src/llm/openai_compat_stream.rs Outdated
Comment on lines +55 to +60
// Use only a connect-timeout so that long streams are not cut off.
let client = reqwest::Client::builder()
.connect_timeout(Duration::from_secs(30))
.build()
.expect("failed to build reqwest::Client for openai_compat streaming");
Self {
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenAiCompatStreamingProvider::new() uses .expect(...) on reqwest::Client::builder().build(), which can panic the process at runtime (e.g., TLS backend misconfig). Consider returning Result<Self, LlmError> (or building the client outside and passing it in) so provider construction failures are surfaced as normal errors instead of crashing.

Copilot uses AI. Check for mistakes.
Comment thread src/llm/mod.rs Outdated
Comment on lines +291 to +334
config.model.clone(),
extra_headers_vec,
unsupported,
);
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The streaming wrapper is constructed with config.base_url.clone(), but the rig-core client uses normalize_openai_base_url(&config.base_url) to append /v1 for bare host URLs. Because OpenAiCompatStreamingProvider::completions_url() assumes the base URL already includes the API version prefix, streaming requests can hit the wrong path (e.g. http://localhost:8080/chat/completions instead of /v1/chat/completions). Pass the normalized base URL into the streaming provider (or normalize inside completions_url()) so streaming and non-streaming calls target the same endpoint.

Copilot uses AI. Check for mistakes.
Comment thread src/llm/mod.rs Outdated
Comment on lines +258 to +334
config.model.clone(),
extra_headers_vec,
unsupported,
);
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra_headers are validated and invalid names/values are skipped when building the rig-core OpenAI-compatible client, but the streaming wrapper rebuilds headers directly from config.extra_headers without validation. This can make streaming fail in cases where non-streaming works (or reintroduce headers that were intentionally skipped). Reuse the already-validated HeaderMap (convert it to pairs) or apply the same validation/skipping logic before storing extra_headers for streaming.

Copilot uses AI. Check for mistakes.
Comment on lines +389 to +393
let model = req
.take_model_override()
.unwrap_or_else(|| self.model_name.clone());
let messages = messages_to_json(&req.messages);

Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

complete_stream bypasses the inner RigAdapter and serializes req.messages directly, but it does not run sanitize_tool_messages like RigAdapter::complete() does (src/llm/rig_adapter.rs:1031). This can reintroduce OpenAI 400s due to orphaned tool_result messages. Sanitize (and apply any other message normalization you rely on) before calling messages_to_json().

Copilot uses AI. Check for mistakes.
Comment on lines +435 to +438
let model = req
.take_model_override()
.unwrap_or_else(|| self.model_name.clone());
let messages = messages_to_json(&req.messages);
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

complete_with_tools_stream bypasses the inner RigAdapter but does not call sanitize_tool_messages on req.messages (which RigAdapter::complete_with_tools() does at src/llm/rig_adapter.rs:1088). This can cause upstream request failures if the history contains orphaned tool_result messages. Sanitize messages before converting them to OpenAI JSON.

Suggested change
let model = req
.take_model_override()
.unwrap_or_else(|| self.model_name.clone());
let messages = messages_to_json(&req.messages);
fn sanitize_openai_tool_messages(messages: serde_json::Value) -> serde_json::Value {
let Some(items) = messages.as_array() else {
return messages;
};
let mut pending_tool_call_ids = HashSet::new();
let mut sanitized = Vec::with_capacity(items.len());
for message in items {
let Some(obj) = message.as_object() else {
sanitized.push(message.clone());
continue;
};
match obj.get("role").and_then(|role| role.as_str()) {
Some("assistant") => {
if let Some(tool_calls) = obj.get("tool_calls").and_then(|tc| tc.as_array())
{
for tool_call in tool_calls {
if let Some(id) = tool_call
.as_object()
.and_then(|tc| tc.get("id"))
.and_then(|id| id.as_str())
{
pending_tool_call_ids.insert(id.to_string());
}
}
}
sanitized.push(message.clone());
}
Some("tool") => {
let tool_call_id = obj.get("tool_call_id").and_then(|id| id.as_str());
if let Some(tool_call_id) = tool_call_id
&& pending_tool_call_ids.remove(tool_call_id)
{
sanitized.push(message.clone());
}
}
_ => sanitized.push(message.clone()),
}
}
serde_json::Value::Array(sanitized)
}
let model = req
.take_model_override()
.unwrap_or_else(|| self.model_name.clone());
let messages = sanitize_openai_tool_messages(messages_to_json(&req.messages));

Copilot uses AI. Check for mistakes.
Comment on lines +231 to +244
result.tool_calls = tool_acc
.into_values()
.filter(|p| !p.name.is_empty())
.map(|p| {
let arguments = serde_json::from_str::<serde_json::Value>(&p.arguments)
.unwrap_or_else(|_| serde_json::Value::Object(Default::default()));
ToolCall {
id: p.id,
name: p.name,
arguments,
reasoning: None,
}
})
.collect();
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a streamed tool_call arguments string fails JSON parsing, the code currently falls back to an empty object ({}). This can silently change semantics by invoking a tool with missing parameters instead of preserving the raw payload or surfacing an invalid-response error. Other providers preserve the raw string on parse failure (e.g. src/llm/openai_codex_provider.rs:671-674). Consider returning LlmError::InvalidResponse or storing arguments as a serde_json::Value::String when parsing fails.

Copilot uses AI. Check for mistakes.
- dispatcher+reasoning: replace unbounded channel with bounded(256) + try_send to drop chunks on overflow instead of growing memory without bound

- mod.rs: normalize base_url via normalize_openai_base_url before passing to streaming provider so the streaming path hits the same endpoint as rig-core

- mod.rs: reuse validated HeaderMap to build extra_headers for streaming provider (skips invalid names/values instead of passing raw)

- openai_compat_stream: OpenAiCompatStreamingProvider::new now returns Result instead of panicking via expect() on reqwest build failure

- openai_compat_stream: add total request timeout (600s) in addition to connect_timeout so hung upstream cannot leak tasks

- openai_compat_stream: call sanitize_tool_messages before messages_to_json in both streaming methods (matches RigAdapter behavior)

- openai_compat_stream: preserve raw tool_call argument string (as JSON string) on parse failure instead of silently defaulting to {} + log warning
Copilot AI review requested due to automatic review settings April 23, 2026 09:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +88 to +92
async fn stream_request(
&self,
body: serde_json::Value,
on_chunk: &mut (dyn FnMut(String) + Send),
) -> Result<OaiStreamResult, LlmError> {
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stream_request/SSE parsing and tool_call accumulation is substantial new logic but has no unit coverage. Consider adding focused tests like codex_chatgpt.rs’s SSE parser tests (e.g., content deltas, tool_call argument concatenation, usage-in-last-chunk, and [DONE] termination), plus an error-mapping test for 5xx/400/413 paths.

Copilot uses AI. Check for mistakes.
Comment on lines +129 to +133
_ => LlmError::RequestFailed {
provider: "openai_compat".to_string(),
reason: format!("HTTP {}: {}", status, truncated),
},
});
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error responses currently fall through to LlmError::RequestFailed for all non-401/403/429 statuses, including HTTP 5xx. Per .claude/rules/error-handling.md and the LlmError::BadGateway docs, upstream 5xx bodies must not be carried in a user-facing error; map 500..=599 to LlmError::BadGateway { status, retry_after } and only log a truncated body preview at debug! for operators.

Copilot uses AI. Check for mistakes.
Comment on lines +114 to +118
if !status.is_success() {
let code = status.as_u16();
let retry_after = Some(crate::llm::retry::parse_retry_after(
response.headers().get("retry-after"),
));
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The streaming wrapper doesn't currently translate context-length / payload-too-large failures into LlmError::ContextLengthExceeded, so the dispatcher auto-compaction path won't trigger for streaming calls. Handle HTTP 400/413 by inspecting the error payload (similar patterns as rig_adapter::map_rig_error / nearai_chat) and returning ContextLengthExceeded { used, limit } where possible.

Copilot uses AI. Check for mistakes.
Comment thread src/llm/openai_compat_stream.rs Outdated
let retry_after = Some(crate::llm::retry::parse_retry_after(
response.headers().get("retry-after"),
));
let text = response.text().await.unwrap_or_default();
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response.text().await.unwrap_or_default() is a silent-failure pattern forbidden by .claude/rules/error-handling.md (it collapses IO errors into an empty body). Prefer unwrap_or_else with an explicit marker string, or map the body read failure into LlmError::RequestFailed so debugging isn't silently degraded.

Suggested change
let text = response.text().await.unwrap_or_default();
let text = response
.text()
.await
.unwrap_or_else(|e| format!("<failed to read error body: {}>", e));

Copilot uses AI. Check for mistakes.
Comment thread src/llm/openai_compat_stream.rs Outdated
// Match RigAdapter behavior: rewrite orphaned tool_result messages as
// user messages so OpenAI-compatible endpoints do not reject the
// request with 400 "messages with role 'tool' must be a response to
// a preceeding message with 'tool_calls'".
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the error message example: "preceeding" should be "preceding".

Suggested change
// a preceeding message with 'tool_calls'".
// a preceding message with 'tool_calls'".

Copilot uses AI. Check for mistakes.
rizgan added 2 commits April 23, 2026 12:32
- unwrap_or_else on error body read to preserve failure context
- map HTTP 5xx to LlmError::BadGateway (no body leak, operator debug log)
- map HTTP 413 and context-length 400 to LlmError::ContextLengthExceeded
- fix typo: preceeding -> preceding
Copilot AI review requested due to automatic review settings April 23, 2026 09:49
@github-actions github-actions Bot added the scope: workspace Persistent memory / workspace label Apr 23, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/llm/reasoning.rs Outdated
/// When set, each text token/chunk from streaming LLM calls is sent to this
/// channel so callers can forward it to the client in real time.
///
/// Bounded to apply backpressure; chunks that cannot be queued are dropped.
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment says the channel is "bounded to apply backpressure" but the implementation uses try_send and explicitly drops chunks on overflow, which is lossy rather than backpressure. Consider rewording to reflect the actual behavior (bounded to cap memory; drops when full), or switch to send().await if true backpressure is desired.

Suggested change
/// Bounded to apply backpressure; chunks that cannot be queued are dropped.
/// Typically backed by a bounded channel to cap memory usage; chunks that
/// cannot be queued may be dropped when the buffer is full.

Copilot uses AI. Check for mistakes.
Comment thread src/workspace/mod.rs Outdated
Comment on lines +1710 to +1721
// Inject current date/time so the model can answer time-related questions
// without guessing or intercepting them at the bot layer.
let now_str = match tz {
Some(t) => {
let dt = crate::timezone::now_in_tz(t);
format!("{} ({})", dt.format("%Y-%m-%d %H:%M"), t)
}
None => {
format!("{} (UTC)", Utc::now().format("%Y-%m-%d %H:%M"))
}
};
parts.push(format!("## Current Time\n\n{}", now_str));
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change injects current time into the system prompt, but the PR description focuses on LLM streaming/provider-chain changes and doesn't mention prompt-time injection. If this is intentional, please update the PR description (and/or split into a separate PR) so reviewers can assess the behavioral impact on prompting and caching independently.

Copilot uses AI. Check for mistakes.
Comment on lines +106 to +110
let response = builder.json(&body).send().await.map_err(|e| {
LlmError::RequestFailed {
provider: "openai_compat".to_string(),
reason: e.to_string(),
}
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The streaming provider hardcodes provider: "openai_compat" in generated LlmErrors. Since this wrapper is used for multiple registry backends (OpenRouter, Groq, etc.), error attribution/logging becomes ambiguous. Consider passing provider_id (or a human-readable label) into OpenAiCompatStreamingProvider and using it consistently in LlmError construction and logs.

Copilot uses AI. Check for mistakes.
Comment on lines +293 to +298
ToolCall {
id: p.id,
name: p.name,
arguments,
reasoning: None,
}
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On streamed tool calls, ToolCall.id is taken directly from the upstream tool_calls[*].id and may be empty or non-compliant with downstream constraints (some providers require a non-empty [A-Za-z0-9]{9} ID). Consider normalizing/generating a compliant ID when p.id is empty/invalid (similar to rig_adapter::normalized_tool_call_id / generate_tool_call_id) to avoid tool-execution failures.

Copilot uses AI. Check for mistakes.
Comment thread src/llm/openai_compat_stream.rs Outdated
tracing::warn!(
tool = %p.name,
error = %e,
raw = %p.arguments,
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging the full raw streamed tool-call arguments at warn level can leak user data and secrets into logs. Prefer omitting the raw payload (or logging only a truncated/sanitized preview) while still preserving the raw string in the returned ToolCall.arguments for downstream error reporting.

Suggested change
raw = %p.arguments,
raw_len = p.arguments.len(),

Copilot uses AI. Check for mistakes.
Comment thread src/llm/openai_compat_stream.rs Outdated
let mut parts =
vec![serde_json::json!({"type": "text", "text": msg.content})];
for p in &msg.content_parts {
parts.push(serde_json::to_value(p).unwrap_or_default());
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serde_json::to_value(p).unwrap_or_default() silently drops serialization errors and inserts a default null part, which can produce malformed requests that are hard to diagnose. Prefer propagating/returning an error (or at least logging and skipping the part) so failures are explicit.

Suggested change
parts.push(serde_json::to_value(p).unwrap_or_default());
match serde_json::to_value(p) {
Ok(value) => parts.push(value),
Err(err) => eprintln!(
"failed to serialize chat message content part for role '{}': {}",
role, err
),
}

Copilot uses AI. Check for mistakes.
rizgan added 2 commits April 23, 2026 13:10
- Fix doc comment: 'backpressure' -> accurate 'may be dropped' wording
- Add provider_id field to OpenAiCompatStreamingProvider for correct
  error attribution across OpenRouter, Groq, etc. backends
- Replace raw = %p.arguments with raw_len to avoid leaking user data
  in warn! logs
- Normalize streamed tool_call IDs via normalize_tool_call_id_for_streaming
  (same [a-zA-Z0-9]{9} constraint as rig_adapter)
- Replace to_value(p).unwrap_or_default() with explicit warn + skip
Copilot AI review requested due to automatic review settings April 23, 2026 10:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +119 to +123
let code = status.as_u16();
let retry_after = Some(crate::llm::retry::parse_retry_after(
response.headers().get("retry-after"),
));
let text = response
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retry_after is always wrapped in Some(parse_retry_after(...)), which means missing Retry-After headers become Some(60s) and defeats the exponential backoff path for BadGateway retries (this is the same regression called out in retry.rs docs/tests). Preserve header absence by using response.headers().get("retry-after").map(parse_retry_after_value) for 5xx (and only default missing headers for 429 if desired).

Copilot uses AI. Check for mistakes.
Comment thread src/llm/failover.rs
Comment on lines +332 to +340
async fn complete_stream(
&self,
request: CompletionRequest,
on_chunk: &mut (dyn FnMut(String) + Send),
) -> Result<CompletionResponse, LlmError> {
self.providers[self.last_used.load(Ordering::Relaxed)]
.complete_stream(request, on_chunk)
.await
}
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

complete_stream / complete_with_tools_stream delegate to last_used, but they don’t call bind_provider_to_current_task(...) like the non-streaming paths. Under concurrency this can cause effective_model_name() (and therefore cost attribution/metrics) to report the wrong provider if another request updates last_used before the caller reads it. Capture the index up front and bind it for the current task before/after the streaming call.

Copilot uses AI. Check for mistakes.
Comment thread src/llm/provider.rs
on_chunk: &mut (dyn FnMut(String) + Send),
) -> Result<CompletionResponse, LlmError> {
let resp = self.complete(request).await?;
on_chunk(resp.content.clone());
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default complete_stream() fallback always invokes on_chunk even when resp.content is empty, while complete_with_tools_stream() explicitly suppresses empty content. Consider aligning behavior by skipping the callback when the content is empty to avoid emitting spurious empty stream chunks to clients.

Suggested change
on_chunk(resp.content.clone());
if !resp.content.is_empty() {
on_chunk(resp.content.clone());
}

Copilot uses AI. Check for mistakes.
Comment thread src/agent/dispatcher.rs
Comment on lines +633 to +636
{
let (chunk_tx, mut chunk_rx) =
tokio::sync::mpsc::channel::<String>(256);
let channels = Arc::clone(&self.agent.channels);
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The streaming bridge uses a hard-coded channel capacity of 256. If this value is expected to be tuned (or kept consistent with other buffering limits), consider extracting it to a named constant or config to avoid a magic number here and make operational tuning easier.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor: new First-time contributor risk: medium Business logic, config, or moderate-risk modules scope: agent Agent core (agent loop, router, scheduler) scope: llm LLM integration scope: workspace Persistent memory / workspace size: XL 500+ changed lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants