Skip to content

GenAI: support SSE streaming responses and raise HTTP capture limit to 256KB#2394

Open
NameHaibinZhang wants to merge 7 commits into
open-telemetry:mainfrom
NameHaibinZhang:feature/llm-stream
Open

GenAI: support SSE streaming responses and raise HTTP capture limit to 256KB#2394
NameHaibinZhang wants to merge 7 commits into
open-telemetry:mainfrom
NameHaibinZhang:feature/llm-stream

Conversation

@NameHaibinZhang

@NameHaibinZhang NameHaibinZhang commented Jun 16, 2026

Copy link
Copy Markdown
Member

Summary

Fixes #2267

This PR adds SSE (Server-Sent Events) streaming response support for GenAI spans and raises the HTTP payload capture limit to 256KB (accumulated across multiple recv calls).

Changes

SSE streaming support:

  • Add parseOpenAIStream() to detect and parse OpenAI/Qwen SSE streaming responses, accumulating content deltas, tool calls, and usage statistics across multiple chunks.
  • Improve Anthropic JSON vs SSE classification using looksLikeJSON() instead of a naive first-byte check, handling responses with leading whitespace.
  • Add Qwen support for OpenAI-compatible endpoints (e.g. vLLM deployments) with SSE streaming.

Large response body support:

  • Raise MaxCapturedPayloadBytes from 64KB to 256KB for HTTP. Other protocols (MySQL, Kafka, Postgres, etc.) remain at 64KB.
  • The BPF layer per-syscall limit (k_large_buf_max_http_captured_bytes = 64KB) is unchanged — the 256KB connection-level limit is achieved by accumulating across multiple recv calls (up to 4 × 64KB chunks).

Truncated body resilience:

  • Add extractJSONRawField() to recover complete JSON fields (e.g. messages) from truncated request bodies, improving GenAI attribute extraction when payloads exceed capture limits.

Validation

@NameHaibinZhang NameHaibinZhang requested a review from a team as a code owner June 16, 2026 08:41
@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 65.84158% with 69 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.54%. Comparing base (92525d9) to head (a607881).
⚠️ Report is 28 commits behind head on main.

Files with missing lines Patch % Lines
pkg/ebpf/common/http/openai_stream.go 68.93% 28 Missing and 4 partials ⚠️
pkg/ebpf/common/http/qwen.go 48.64% 17 Missing and 2 partials ⚠️
pkg/ebpf/common/http/partial_json.go 65.71% 10 Missing and 2 partials ⚠️
pkg/ebpf/common/tcp_large_buffer.go 64.28% 5 Missing ⚠️
pkg/ebpf/common/http/openai.go 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2394      +/-   ##
==========================================
+ Coverage   69.35%   69.54%   +0.19%     
==========================================
  Files         321      326       +5     
  Lines       42164    42928     +764     
==========================================
+ Hits        29243    29856     +613     
- Misses      11243    11356     +113     
- Partials     1678     1716      +38     
Flag Coverage Δ
integration-test 50.67% <4.45%> (-0.63%) ⬇️
integration-test-arm 29.75% <0.99%> (+1.24%) ⬆️
integration-test-vm-5.15-lts 30.33% <1.27%> (+1.21%) ⬆️
integration-test-vm-6.18-lts 28.02% <1.27%> (-2.50%) ⬇️
k8s-integration-test 37.88% <1.27%> (-0.72%) ⬇️
oats-test 36.16% <31.84%> (-0.71%) ⬇️
unittests 62.40% <82.16%> (+0.27%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mariomac

Copy link
Copy Markdown
Contributor

In large_buffer.h, there is a bunch of definitions like:

    k_large_buf_max_http_captured_bytes = 1 << 16,
    k_large_buf_max_mysql_captured_bytes = 1 << 16,
    k_large_buf_max_postgres_captured_bytes = 1 << 16,
    k_large_buf_max_kafka_captured_bytes = 1 << 16,
    k_large_buf_max_mssql_captured_bytes = 1 << 16,
    k_large_buf_max_tcp_captured_bytes = 1 << 16,

Do we need to update k_large_buf_max_http_captured_bytes at least to fit 256Kb?

@mariomac mariomac left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition! Added few suggestions

Comment thread pkg/ebpf/common/http/openai_stream.go
Comment thread pkg/ebpf/common/http_transform.go Outdated
Comment thread pkg/ebpf/common/http_transform.go Outdated
Comment thread pkg/ebpf/common/http/openai_stream.go Outdated
Comment thread bpf/common/large_buffers.h Outdated
Comment thread bpf/common/large_buffers.h Outdated
Comment thread devdocs/config/CONFIG.md Outdated
Comment thread devdocs/features.md Outdated
Comment thread pkg/config/ebpf_tracer.go Outdated
Comment thread pkg/config/ebpf_tracer.go Outdated
Comment thread pkg/config/ebpf_tracer.go Outdated
Comment thread pkg/ebpf/common/http/openai.go Outdated
Comment thread pkg/ebpf/common/tcp_large_buffer.go Outdated

@rafaelroquetto rafaelroquetto left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Review by Claude (Claude Code), posted by @rafaelroquetto on its behalf. This is an automated, ownership-focused review aligned with AGENTS.md and the repo's PR-review guidance — not Rafael's personal review (he has left separate comments inline).

The SSE parsing and the 256KB cross-syscall accumulation are sound at their core. My main concern is scope and ownership, per AGENTS.md ("Keep changes minimal and scoped to the task. Do not include unrelated edits"). Concrete code-level bugs are left as inline comments; the structural points are below.

1. Scope creep — the PR does much more than its title (please split)

Title: "support SSE streaming responses and raise HTTP capture limit to 256KB", but the diff also bundles features the description never mentions:

  • Rerank provider-format supportRerankRequest.NestedInput/NestedParams, GetQuery/GetDocuments/GetTopN, RerankResponse.NestedOutput, GetResults/GetID, VendorRerank.GetInput/GetOutput, gen_ai.rerank.top_n (span.go, tracesgen.go).
  • Retrieval top-kRetrievalRequest.TopK/TopKSnake/Limit, GetTopK, gen_ai.retrieval.top_k.
  • Embedding dimensionsGetEmbeddingDimensions() + a duplicated emit block in tracesgen.go.
  • MCP tool-call argument/result capturemcp.go, span.go, attrs.go, tracesgen.go.
  • Two SSL-capture band-aidsrecoverJSONBodyFromBuffer and the TLS-ciphertext drop, both in the final merge llm commit.

None of rerank/retrieval/embedding/SSL is needed for SSE + 256KB. Each is reasonable on its own, but bundled they are hard to review and to revert independently. Please split into focused PRs, each explaining why the approach fits this codebase.

2. Duplication instead of reuse

The JSON-vs-SSE response branch is copy-pasted near-verbatim in openai.go and qwen.go (and Anthropic now also uses looksLikeJSON). AGENTS.md: "Extend or adapt existing code instead of duplicating functionality." Please extract one helper, e.g. parseOpenAICompatibleResponse(respB) (request.VendorOpenAI, []request.ToolCall). (Inline note on the openai.go site.)

3. Smaller items

  • large_buffers.h: the pre-existing "must equal the lte= values" comment is now false for HTTP (64KB const vs 256KB ceiling), and the appended paragraph frames cross-syscall accumulation as HTTP-specific — every protocol accumulates identically; only HTTP's configured ceiling differs. A one-liner would be clearer.
  • MaxCapturedPayloadBytes is now HTTP-only in meaning but keeps a generic name and is consumed solely as the decompression budget in responses.go; the 64KB→256KB bump silently 4×'s that budget. Consider a dedicated decompression constant.
  • extractJSONRawField finds the field via a plain bytes.Index("\"messages\""), which can match inside a string value or nested key rather than the documented top-level field.
  • Commit messages (fix test, fix attr, merge llm, fix stream) don't convey intent; merge llm bundles the two unrelated SSL changes.

Asks

  1. Split the unrelated features (rerank/retrieval/embedding/SSL) into separate PRs.
  2. Deduplicate the OpenAI-compatible response branch.
  3. Fix the inline-flagged bugs (scanner cap, dead args, UnsafeView lifetime).
  4. Confirm local validation (make verify, make test) on the trimmed change.

Comment thread pkg/ebpf/common/http/openai_stream.go
Comment thread pkg/ebpf/common/http/openai_stream.go
Comment thread pkg/ebpf/common/http_transform.go Outdated
Comment thread pkg/ebpf/common/http/openai.go Outdated
@NameHaibinZhang

NameHaibinZhang commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

No — k_large_buf_max_http_captured_bytes is intentionally kept at 64 KB because it is the per-syscall emission cap, not the connection-level limit. The 256 KB budget is reached by accumulating across multiple tcp_recvmsg calls: each call emits up to 64 KB (4 × 16 KB ring-buffer events), and the running totals lb_req_bytes / lb_res_bytes track consumption until http_max_captured_bytes (256 KB) is exhausted. Keeping the per-call cap at 64 KB bounds the large_buf_emit_chunks() loop to 4 iterations, which is important for verifier complexity. @mariomac

…o 256KB

- Add SSE (Server-Sent Events) stream parser for OpenAI-compatible APIs
- Support streaming responses in OpenAI, Qwen, and Anthropic handlers
- Extract shared parseOpenAICompatibleResponse() helper to reduce duplication
- Raise HTTP payload capture limit to 256KB (accumulated across multiple recv calls)
- Add looksLikeJSON() helper for content-type detection
- Update configuration docs and schema

Relates to open-telemetry#2267
@NameHaibinZhang

Copy link
Copy Markdown
Member Author

PR Split Summary

Per @rafaelroquetto's review feedback on scope and ownership, this PR has been split into focused, independently reviewable changes:

# PR Scope
1 #2394 (this PR, reset) Core: SSE streaming response parsing + 256KB HTTP capture limit
2 #2407 JSON body recovery from truncated TLS capture buffer
3 #2408 MCP tool-call arguments and results capture
4 #2409 Retrieval top_k extraction (multi-vendor)
5 #2410 Rerank nested request/response format support
6 #2411 Embedding dimensions derivation from response data

All PRs relate to #2267. Each is self-contained and can be reviewed/merged independently.

@mariomac mariomac left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing! Happy with the changes related to this PR (will check the others later). Will wait for Rafael's approval before merging.

Comment thread pkg/config/ebpf_tracer.go Outdated

@rafaelroquetto rafaelroquetto left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks much better, many thanks for all of this. I have one comment left regarding MaxCapturedPayloadBytes, once we address that we can merge this

@NameHaibinZhang

Copy link
Copy Markdown
Member Author

@rafaelroquetto done

@mmat11 mmat11 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has this been tested manually? can we also add an integration test which exercise the "super large buffer" path?

Comment thread bpf/common/large_buffers.h
// which enforces the same ceiling at configuration time.
k_large_buf_payload_max_size = 1 << 14,
k_large_buf_max_size = 1 << 15,
k_large_buf_max_http_captured_bytes = 1 << 16,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is gated to 64KB, which means

static __always_inline int http_send_large_buffer(http_info_t *req,
                                                  const void *u_buf,
                                                  u32 bytes_len,
                                                  u8 packet_type,
                                                  u8 direction,
                                                  enum large_buf_action action) {
    if (http_max_captured_bytes > k_large_buf_max_http_captured_bytes) {
        bpf_dbg_printk("BUG: http_max_captured_bytes exceeds maximum allowed value.");
    }

will fail if the value is higher than that?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question — it won't cause a functional failure. Here's the layered design:

  • http_max_captured_bytes (set from userspace, up to 256KB) is the total per-request per-direction budget. It controls when to stop accumulating data (bytes_sent >= http_max_captured_bytes).
  • k_large_buf_max_http_captured_bytes (64KB) is the per-syscall-event cap, used in bpf_clamp_umax(max_available_bytes, k_large_buf_max_http_captured_bytes) to bound each ring buffer submission. This is critical for the BPF verifier to prove memory safety.
  • The 256KB total budget is reached by accumulating multiple 64KB chunks across successive tcp_recvmsg events.

The bpf_dbg_printk("BUG: ...") check is a debug-only assertion (gated by g_bpf_debug, compiled out in production) that predates the HTTP limit raise. It's now stale for HTTP — I can clean it up or update the condition to reflect the new semantic. It does not affect correctness since the code continues past it regardless and the actual safety bound is enforced by bpf_clamp_umax.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought there was a return statement. I think we should clean it up if it's pointless

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — removed the stale assertion. It was comparing the per-request budget (256KB) against the per-syscall cap (64KB), which is now expected behavior in the multi-chunk design.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right approach. IMHO the statement is correct: it is a bug to set a value that is lager than k_large_buf_max_http_captured_bytes because k_large_buf_max_http_captured_bytes is used to bound max_available_bytes for the verifier. So if max_available_bytes is larger than k_large_buf_max_http_captured_bytes, we end up truncating the buffer and throwing away the remainder of the payload silently.

So IMHO we should at least log that, or try to remove the clamp and see if the verifier is happy.

@NameHaibinZhang NameHaibinZhang Jun 18, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rafaelroquetto You're absolutely right — this was a real bug. If a single tcp_recvmsg delivers more than 64KB, BPF truncates the chunk, and blindly appending subsequent truncated chunks would create holes in the reassembled buffer.

Fixed: the userspace reassembly now detects truncation by checking whether the accumulated data for a single emission hits exactly the per-syscall cap (64KB) with a full final chunk (16KB payload). When that pattern is detected, the buffer is "sealed" and subsequent chunks for that direction are discarded — ensuring the assembled data is always a contiguous prefix (tail truncation only, no holes).

Added unit tests covering: truncation detection, non-truncation pass-through, and seal reset on new request.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NameHaibinZhang thanks for iterating this but I don't think this is the right approach and I'd rather we don't go down the userspace-heuristic path at all.

The userspace simply has no way of knowing whether a buffer was actually truncated. A syscall that got clamped at 64KB and a perfectly healthy one that just happens to land on a 64KB boundary produce exactly the same sequence of chunks, i.e same lengths, same actions, nothing to tell them apart. So len % 64KB == 0 isn't detecting truncation, it's guessing, and it'll guess wrong on legitimate traffic, silently capping perfectly healthy buffers at 64KB, which is the exact regression we're trying to avoid.

The truncation only exists as a fact inside BPF - that's the one place that knows bytes_len > cap so that's where it has to be dealt with.

I do think we can do this properly in eBPF with some effort. The verifier makes it fiddly, I know, and that's probably why we kept everything bounded at 64KB in the first place, but I'd much rather we take the time to respect the semantics and get it right than merge a workaround that quietly regresses the common case.

@NameHaibinZhang NameHaibinZhang Jun 18, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rafaelroquetto You're right — the userspace heuristic can't reliably distinguish truncation from legitimate 64KB-aligned traffic. I'll revert the heuristic and implement this properly in BPF: add a truncated flag to the event metadata set when bytes_len > cap, so userspace gets an explicit signal to stop accumulation. Will push the BPF-side fix shortly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NameHaibinZhang: no, I don't mean it like that - a truncated flag is a workaround. We should be able to ship full 256KB from ebpf, the usual semantics. This will require some ebpf work. I'd suggest first trying to simply increasing the constant and see what the verifier says. In the likely event it complains, you will need to chunk the buffers. I'd rather us take some time to get it right than rush it and get it wrong, there's no urgency here.

@NameHaibinZhang NameHaibinZhang Jun 19, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rafaelroquetto I've already tested raising k_large_buf_max_http_captured_bytes to 256KB directly — it fails the BPF verifier on 5.10 kernels. The loop unrolling in large_buf_emit_chunks goes from 4 iterations (4×16KB = 64KB) to 16 iterations (16×16KB = 256KB), which exceeds the instruction count limit on older kernels.

So the path forward would be a multi-emission approach (e.g. tail calls or multiple ring buffer submissions across separate BPF program invocations) to stay within verifier bounds while delivering the full 256KB. I'll work on that — it'll take a bit more time to get right.

@NameHaibinZhang

NameHaibinZhang commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

Yes, this has been tested manually — I verified SSE streaming responses with payloads exceeding 64KB (up to ~200KB) and confirmed the large buffer path correctly accumulates chunks across multiple tcp_recvmsg events.

Regarding integration tests: agreed, this would be valuable. However, writing a proper integration test for the "super large buffer" path requires setting up an SSE/streaming server that produces >64KB responses and verifying the assembled payload in userspace — which is a non-trivial addition to the test infra. I'll track this as a follow-up. Would that be acceptable, or would you prefer it gated in this PR? @mmat11

@mmat11

mmat11 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Yes, this has been tested manually — I verified SSE streaming responses with payloads exceeding 64KB (up to ~200KB) and confirmed the large buffer path correctly accumulates chunks across multiple tcp_recvmsg events.

Regarding integration tests: agreed, this would be valuable. However, writing a proper integration test for the "super large buffer" path requires setting up an SSE/streaming server that produces >64KB responses and verifying the assembled payload in userspace — which is a non-trivial addition to the test infra. I'll track this as a follow-up. Would that be acceptable, or would you prefer it gated in this PR? @mmat11

follow-up is fine!

The check compared the per-request budget (now 256KB) against the
per-syscall cap (64KB), which is expected in the new multi-chunk
accumulation design. The assertion was debug-only and had no return
statement, making it pointless.
Comment thread pkg/ebpf/common/http/partial_json.go

@rafaelroquetto rafaelroquetto left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks better, thanks for the changes! There are a few failing tests that need to be addressed before approving, and I've reopened one of the discussions regarding k_large_buf_max_http_captured_bytes

Adopted reviewer's suggestion to use json.NewDecoder for extractJSONRawField.
This fixes a bug where string values could be falsely matched as field names
(e.g. {"label":"field","field":99} would match the value "field" instead
of the key). Also removed the now-unnecessary extractJSONRawValue helper.
…erence

When a single tcp_recvmsg delivers more data than the per-syscall BPF
cap (64KB), the captured chunk is truncated. Previously, subsequent
chunks would still be appended, creating holes in the reassembled
buffer. Now we detect truncation (captured == per-syscall cap while
the event is not terminal) and stop accumulating, ensuring the
userspace buffer is always a contiguous prefix — tail truncation only,
no holes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GenAI: support streaming (SSE) responses and large response bodies (>64KB) in HTTP body capture

4 participants