GenAI: support SSE streaming responses and raise HTTP capture limit to 256KB#2394
GenAI: support SSE streaming responses and raise HTTP capture limit to 256KB#2394NameHaibinZhang wants to merge 7 commits into
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2394 +/- ##
==========================================
+ Coverage 69.35% 69.54% +0.19%
==========================================
Files 321 326 +5
Lines 42164 42928 +764
==========================================
+ Hits 29243 29856 +613
- Misses 11243 11356 +113
- Partials 1678 1716 +38
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
175999a to
bc09073
Compare
|
In Do we need to update |
mariomac
left a comment
There was a problem hiding this comment.
Great addition! Added few suggestions
rafaelroquetto
left a comment
There was a problem hiding this comment.
🤖 Review by Claude (Claude Code), posted by @rafaelroquetto on its behalf. This is an automated, ownership-focused review aligned with
AGENTS.mdand the repo's PR-review guidance — not Rafael's personal review (he has left separate comments inline).
The SSE parsing and the 256KB cross-syscall accumulation are sound at their core. My main concern is scope and ownership, per AGENTS.md ("Keep changes minimal and scoped to the task. Do not include unrelated edits"). Concrete code-level bugs are left as inline comments; the structural points are below.
1. Scope creep — the PR does much more than its title (please split)
Title: "support SSE streaming responses and raise HTTP capture limit to 256KB", but the diff also bundles features the description never mentions:
- Rerank provider-format support —
RerankRequest.NestedInput/NestedParams,GetQuery/GetDocuments/GetTopN,RerankResponse.NestedOutput,GetResults/GetID,VendorRerank.GetInput/GetOutput,gen_ai.rerank.top_n(span.go,tracesgen.go). - Retrieval top-k —
RetrievalRequest.TopK/TopKSnake/Limit,GetTopK,gen_ai.retrieval.top_k. - Embedding dimensions —
GetEmbeddingDimensions()+ a duplicated emit block intracesgen.go. - MCP tool-call argument/result capture —
mcp.go,span.go,attrs.go,tracesgen.go. - Two SSL-capture band-aids —
recoverJSONBodyFromBufferand the TLS-ciphertext drop, both in the finalmerge llmcommit.
None of rerank/retrieval/embedding/SSL is needed for SSE + 256KB. Each is reasonable on its own, but bundled they are hard to review and to revert independently. Please split into focused PRs, each explaining why the approach fits this codebase.
2. Duplication instead of reuse
The JSON-vs-SSE response branch is copy-pasted near-verbatim in openai.go and qwen.go (and Anthropic now also uses looksLikeJSON). AGENTS.md: "Extend or adapt existing code instead of duplicating functionality." Please extract one helper, e.g. parseOpenAICompatibleResponse(respB) (request.VendorOpenAI, []request.ToolCall). (Inline note on the openai.go site.)
3. Smaller items
large_buffers.h: the pre-existing "must equal the lte= values" comment is now false for HTTP (64KB const vs 256KB ceiling), and the appended paragraph frames cross-syscall accumulation as HTTP-specific — every protocol accumulates identically; only HTTP's configured ceiling differs. A one-liner would be clearer.MaxCapturedPayloadBytesis now HTTP-only in meaning but keeps a generic name and is consumed solely as the decompression budget inresponses.go; the 64KB→256KB bump silently 4×'s that budget. Consider a dedicated decompression constant.extractJSONRawFieldfinds the field via a plainbytes.Index("\"messages\""), which can match inside a string value or nested key rather than the documented top-level field.- Commit messages (
fix test,fix attr,merge llm,fix stream) don't convey intent;merge llmbundles the two unrelated SSL changes.
Asks
- Split the unrelated features (rerank/retrieval/embedding/SSL) into separate PRs.
- Deduplicate the OpenAI-compatible response branch.
- Fix the inline-flagged bugs (scanner cap, dead
args,UnsafeViewlifetime). - Confirm local validation (
make verify,make test) on the trimmed change.
|
No — |
…o 256KB - Add SSE (Server-Sent Events) stream parser for OpenAI-compatible APIs - Support streaming responses in OpenAI, Qwen, and Anthropic handlers - Extract shared parseOpenAICompatibleResponse() helper to reduce duplication - Raise HTTP payload capture limit to 256KB (accumulated across multiple recv calls) - Add looksLikeJSON() helper for content-type detection - Update configuration docs and schema Relates to open-telemetry#2267
2a62e17 to
00b8fb1
Compare
PR Split SummaryPer @rafaelroquetto's review feedback on scope and ownership, this PR has been split into focused, independently reviewable changes:
All PRs relate to #2267. Each is self-contained and can be reviewed/merged independently. |
mariomac
left a comment
There was a problem hiding this comment.
Thanks for addressing! Happy with the changes related to this PR (will check the others later). Will wait for Rafael's approval before merging.
rafaelroquetto
left a comment
There was a problem hiding this comment.
This looks much better, many thanks for all of this. I have one comment left regarding MaxCapturedPayloadBytes, once we address that we can merge this
|
@rafaelroquetto done |
mmat11
left a comment
There was a problem hiding this comment.
has this been tested manually? can we also add an integration test which exercise the "super large buffer" path?
| // which enforces the same ceiling at configuration time. | ||
| k_large_buf_payload_max_size = 1 << 14, | ||
| k_large_buf_max_size = 1 << 15, | ||
| k_large_buf_max_http_captured_bytes = 1 << 16, |
There was a problem hiding this comment.
this is gated to 64KB, which means
static __always_inline int http_send_large_buffer(http_info_t *req,
const void *u_buf,
u32 bytes_len,
u8 packet_type,
u8 direction,
enum large_buf_action action) {
if (http_max_captured_bytes > k_large_buf_max_http_captured_bytes) {
bpf_dbg_printk("BUG: http_max_captured_bytes exceeds maximum allowed value.");
}
will fail if the value is higher than that?
There was a problem hiding this comment.
Good question — it won't cause a functional failure. Here's the layered design:
http_max_captured_bytes(set from userspace, up to 256KB) is the total per-request per-direction budget. It controls when to stop accumulating data (bytes_sent >= http_max_captured_bytes).k_large_buf_max_http_captured_bytes(64KB) is the per-syscall-event cap, used inbpf_clamp_umax(max_available_bytes, k_large_buf_max_http_captured_bytes)to bound each ring buffer submission. This is critical for the BPF verifier to prove memory safety.- The 256KB total budget is reached by accumulating multiple 64KB chunks across successive
tcp_recvmsgevents.
The bpf_dbg_printk("BUG: ...") check is a debug-only assertion (gated by g_bpf_debug, compiled out in production) that predates the HTTP limit raise. It's now stale for HTTP — I can clean it up or update the condition to reflect the new semantic. It does not affect correctness since the code continues past it regardless and the actual safety bound is enforced by bpf_clamp_umax.
There was a problem hiding this comment.
I thought there was a return statement. I think we should clean it up if it's pointless
There was a problem hiding this comment.
Done — removed the stale assertion. It was comparing the per-request budget (256KB) against the per-syscall cap (64KB), which is now expected behavior in the multi-chunk design.
There was a problem hiding this comment.
I don't think this is the right approach. IMHO the statement is correct: it is a bug to set a value that is lager than k_large_buf_max_http_captured_bytes because k_large_buf_max_http_captured_bytes is used to bound max_available_bytes for the verifier. So if max_available_bytes is larger than k_large_buf_max_http_captured_bytes, we end up truncating the buffer and throwing away the remainder of the payload silently.
So IMHO we should at least log that, or try to remove the clamp and see if the verifier is happy.
There was a problem hiding this comment.
@rafaelroquetto You're absolutely right — this was a real bug. If a single tcp_recvmsg delivers more than 64KB, BPF truncates the chunk, and blindly appending subsequent truncated chunks would create holes in the reassembled buffer.
Fixed: the userspace reassembly now detects truncation by checking whether the accumulated data for a single emission hits exactly the per-syscall cap (64KB) with a full final chunk (16KB payload). When that pattern is detected, the buffer is "sealed" and subsequent chunks for that direction are discarded — ensuring the assembled data is always a contiguous prefix (tail truncation only, no holes).
Added unit tests covering: truncation detection, non-truncation pass-through, and seal reset on new request.
There was a problem hiding this comment.
@NameHaibinZhang thanks for iterating this but I don't think this is the right approach and I'd rather we don't go down the userspace-heuristic path at all.
The userspace simply has no way of knowing whether a buffer was actually truncated. A syscall that got clamped at 64KB and a perfectly healthy one that just happens to land on a 64KB boundary produce exactly the same sequence of chunks, i.e same lengths, same actions, nothing to tell them apart. So len % 64KB == 0 isn't detecting truncation, it's guessing, and it'll guess wrong on legitimate traffic, silently capping perfectly healthy buffers at 64KB, which is the exact regression we're trying to avoid.
The truncation only exists as a fact inside BPF - that's the one place that knows bytes_len > cap so that's where it has to be dealt with.
I do think we can do this properly in eBPF with some effort. The verifier makes it fiddly, I know, and that's probably why we kept everything bounded at 64KB in the first place, but I'd much rather we take the time to respect the semantics and get it right than merge a workaround that quietly regresses the common case.
There was a problem hiding this comment.
@rafaelroquetto You're right — the userspace heuristic can't reliably distinguish truncation from legitimate 64KB-aligned traffic. I'll revert the heuristic and implement this properly in BPF: add a truncated flag to the event metadata set when bytes_len > cap, so userspace gets an explicit signal to stop accumulation. Will push the BPF-side fix shortly.
There was a problem hiding this comment.
@NameHaibinZhang: no, I don't mean it like that - a truncated flag is a workaround. We should be able to ship full 256KB from ebpf, the usual semantics. This will require some ebpf work. I'd suggest first trying to simply increasing the constant and see what the verifier says. In the likely event it complains, you will need to chunk the buffers. I'd rather us take some time to get it right than rush it and get it wrong, there's no urgency here.
There was a problem hiding this comment.
@rafaelroquetto I've already tested raising k_large_buf_max_http_captured_bytes to 256KB directly — it fails the BPF verifier on 5.10 kernels. The loop unrolling in large_buf_emit_chunks goes from 4 iterations (4×16KB = 64KB) to 16 iterations (16×16KB = 256KB), which exceeds the instruction count limit on older kernels.
So the path forward would be a multi-emission approach (e.g. tail calls or multiple ring buffer submissions across separate BPF program invocations) to stay within verifier bounds while delivering the full 256KB. I'll work on that — it'll take a bit more time to get right.
|
Yes, this has been tested manually — I verified SSE streaming responses with payloads exceeding 64KB (up to ~200KB) and confirmed the large buffer path correctly accumulates chunks across multiple Regarding integration tests: agreed, this would be valuable. However, writing a proper integration test for the "super large buffer" path requires setting up an SSE/streaming server that produces >64KB responses and verifying the assembled payload in userspace — which is a non-trivial addition to the test infra. I'll track this as a follow-up. Would that be acceptable, or would you prefer it gated in this PR? @mmat11 |
follow-up is fine! |
The check compared the per-request budget (now 256KB) against the per-syscall cap (64KB), which is expected in the new multi-chunk accumulation design. The assertion was debug-only and had no return statement, making it pointless.
rafaelroquetto
left a comment
There was a problem hiding this comment.
This looks better, thanks for the changes! There are a few failing tests that need to be addressed before approving, and I've reopened one of the discussions regarding k_large_buf_max_http_captured_bytes
Adopted reviewer's suggestion to use json.NewDecoder for extractJSONRawField.
This fixes a bug where string values could be falsely matched as field names
(e.g. {"label":"field","field":99} would match the value "field" instead
of the key). Also removed the now-unnecessary extractJSONRawValue helper.
…erence When a single tcp_recvmsg delivers more data than the per-syscall BPF cap (64KB), the captured chunk is truncated. Previously, subsequent chunks would still be appended, creating holes in the reassembled buffer. Now we detect truncation (captured == per-syscall cap while the event is not terminal) and stop accumulating, ensuring the userspace buffer is always a contiguous prefix — tail truncation only, no holes.
Summary
Fixes #2267
This PR adds SSE (Server-Sent Events) streaming response support for GenAI spans and raises the HTTP payload capture limit to 256KB (accumulated across multiple recv calls).
Changes
SSE streaming support:
parseOpenAIStream()to detect and parse OpenAI/Qwen SSE streaming responses, accumulating content deltas, tool calls, and usage statistics across multiple chunks.looksLikeJSON()instead of a naive first-byte check, handling responses with leading whitespace.Large response body support:
MaxCapturedPayloadBytesfrom 64KB to 256KB for HTTP. Other protocols (MySQL, Kafka, Postgres, etc.) remain at 64KB.k_large_buf_max_http_captured_bytes = 64KB) is unchanged — the 256KB connection-level limit is achieved by accumulating across multiple recv calls (up to 4 × 64KB chunks).Truncated body resilience:
extractJSONRawField()to recover complete JSON fields (e.g.messages) from truncated request bodies, improving GenAI attribute extraction when payloads exceed capture limits.Validation