Skip to content

feat(tracing): add chunked tail-call traceparent scanner for large HTTP headers#1988

Open
smehboub wants to merge 8 commits into
open-telemetry:mainfrom
smehboub:feat/chunked-traceparent-scanner
Open

feat(tracing): add chunked tail-call traceparent scanner for large HTTP headers#1988
smehboub wants to merge 8 commits into
open-telemetry:mainfrom
smehboub:feat/chunked-traceparent-scanner

Conversation

@smehboub

@smehboub smehboub commented May 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Find traceparent headers buried after large HTTP headers (auth tokens, cookies, …).

Fixes: #1381


Description

When a service places large headers before traceparent, the header can land beyond
the first ~1 KB that OBI scans. OBI then misses it and generates a new trace ID,
silently breaking the distributed trace.

This PR fixes that by scanning the request in successive 956-byte windows until the
traceparent header is found or the end of the HTTP headers is reached. The scan
stops before the request body and never reads more than the configured budget
(OTEL_EBPF_BPF_MAX_REQUEST_TP_PARSE_SIZE_KB, default 4 KB, max 27 KB).

The feature is active when OTEL_EBPF_CONTEXT_PROPAGATION or
OTEL_EBPF_TRACK_REQUEST_HEADERS is enabled and requires kernel 5.17 or newer.
On older kernels OBI falls back to the existing first-chunk behaviour automatically.


Design

Below is the design explanation needed for review: what is scanned, where it comes
from, what state is carried between scans, which boundary cases are handled, and how
this relates to the tpinjector path.

What is scanned, where it comes from, and when it is copied

The scanner reuses the request buffer pointer already captured during
return_recvmsg. On kernel ≥ 6.0, ITER_UBUF exposes a single contiguous
user-space pointer (iov_ctx->ubuf), stored as args->orig_buf. On kernels
5.17–5.x, ITER_IOVEC is used and iov_ctx->iov[0].iov_base serves as that base
(first segment only — see Limitations below). The SSL / Java-TLS path sets
u_buf_is_user = 1 and passes the already-decrypted user-space buffer unchanged;
orig_buf is not set on these paths, so chunked iterations read u_buf + offset directly
(the SSL record buffer) rather than orig_buf + offset.

The initial pass in __obi_continue_protocol_http_tp reads up to TRACE_BUF_SIZE - 1
bytes using bpf_probe_read (generic — handles both kernel and user memory) from
args->u_buf into tp_char_buf_mem(). This is the existing first-chunk read.

Each subsequent chunked scan in __tp_chunk_scan performs one read of up to
1024 bytes into the same per-CPU scratch buffer. The helper function is selected at
runtime based on u_buf_is_user:

  • bpf_probe_read_user when reading from a user-space buffer (ITER_UBUF, SSL, or
    Java-TLS path)
  • bpf_probe_read_kernel when reading from a kernel-side iovec

The scanner does not write any new persistent state. With a 4 KB budget it issues at
most 4 additional reads beyond the initial pass, and only for requests where
traceparent was not found in the first kilobyte.

Initial pass:
  args->u_buf                             tp_char_buf_mem()   (per-CPU, 1024 B)
  via bpf_probe_read (generic)

Chunk N (niter ≥ 1):
  orig_buf + (niter × 956)                tp_char_buf_mem()   (per-CPU, 1024 B)
  via bpf_probe_read_user or _kernel

State carried between tail calls

There is no new BPF map. State stays in call_protocol_args_t, the existing per-CPU
scratch accessed via protocol_args(). Four fields were added:

Field Set by Purpose
niter caller; incremented iteration counter; drives offset = niter × 956
orig_buf return_recvmsg base user-space pointer for the full buffer
full_bytes_len return_recvmsg total buffer length (may exceed bytes_len)
u_buf_is_user handle_buf_* call bpf_probe_read_user (1) vs _kernel (0)

Between iterations, the only mutation is niter++ inside __tp_chunk_scan when it
returns k_tp_scan_continue. Everything else (orig_buf, full_bytes_len,
u_buf_is_user) is fixed for the lifetime of the event.

Programs and tail-call budget

Two new programs are registered in the existing jump_table:

Program 13 — obi_parse_traceparent_http (initial-request path)

Triggered from __obi_continue_protocol_http_tp when the first-pass
bpf_strstr_tp_loop returns nothing and the total buffer length exceeds 1023 bytes.
Starts at niter = 1 (chunk 0 already scanned). Runs up to 28 times (niter=1..28;
the 28th invocation reads but does not emit a further tail-call), then falls through
to __obi_continue2_protocol_http which is inlined so it always runs even when the
tail-call budget is exhausted. When connection_meta_by_direction returns NULL (no
direction metadata available), the program tail-calls k_tail_continue2_protocol_http
directly without scanning.

Program 14 — obi_parse_traceparent_http_append (large-buffer append path)

Triggered from obi_handle_buf_with_args and __obi_protocol_http when
still_reading(info) is true — i.e., when OBI is accumulating a multi-recv-call
HTTP request. Starts at niter = 0; base_offset = info->len − args->bytes_len
gives the position of this chunk within the total request. Calls
http_send_large_buffer at the end on all paths where info is available
(including when meta is NULL), so the streaming semantics are preserved.

Tail-call budget:

Program 13: 3 preceding TCs (handle_buf_with_args → protocol_http → continue_protocol_http_tp)
            + 28 invocations of TC13 (niter=1..28) = 31
Program 14: 3 preceding TCs (handle_buf_with_args → protocol_http → prog14) — tightest path
            + 29 invocations of TC14 (niter=0..28) = 32 = BPF_MAX_TAIL_CALL_CNT (5.8–5.9)
            (direct handle_buf_with_args path: 2 preceding + 29 invocations = 31)

The 27 KB upper bound on OTEL_EBPF_BPF_MAX_REQUEST_TP_PARSE_SIZE_KB comes from the
prog 14 budget (tightest path). Prog 14 runs for niter=0..28 (29 invocations); at
niter=28 the bottom guard niter+1 < 29 is false and __tp_chunk_scan returns
k_tp_scan_done without emitting a further tail-call. The last read starts at offset
28 × 956 = 26,768 bytes and is capped to the remaining budget:
27,648 − 26,768 = 880 bytes. Last byte covered: index 26,768 + 879 = 27,647
(= 27 × 1024 − 1), covering 27,648 bytes = 27 KB total (indices 0..27,647).
29 × 956 = 27,724 is the start of the next chunk, which is never read: the
bottom guard's budget check (27,724 < 27,648) is false, so no tail-call is issued.

If the initial tail call to prog 13 fails (for example because the tail-call table is
not loaded yet at startup), __obi_continue_protocol_http_tp falls through to its
normal completion path. In that case no extra chunk scan happens, but the existing
first-chunk result is preserved. Prog 13 then inlines __obi_continue2_protocol_http
at the end regardless of how the scan ended, so the request is never dropped.

State machine

The chunked scanner only runs on the bpf_strstr_tp_loop code path (kernel ≥ 5.17
with bpf_loop support). On older kernels, OBI uses bpf_strstr_tp_loop__legacy, so
the condition tp_loop_fn == bpf_strstr_tp_loop is never satisfied and the scanner is
never triggered, regardless of OTEL_EBPF_BPF_MAX_REQUEST_TP_PARSE_SIZE_KB.

tp_state_machine
Initial pass (bpf_strstr_tp_loop):
  traceparent found                → DONE (existing path, no change)
  not found, total_len ≤ 1023     → DONE (no room for more chunks)
  not found, k<5.17               → DONE (bpf_strstr_tp_loop__legacy path, never reaches scanner)
  not found, budget = 0           → DONE (feature administratively disabled)
  not found, total_len > 1023     → SCANNING (niter=1, tail-call prog 13)

__tp_chunk_scan (prog 13 or 14):
  traceparent found  → FOUND  (set_trace_info_for_connection called, continue2)
  \r\n\r\n found     → DONE   (end-of-headers, no traceparent, continue2)
  niter ≥ 29         → DONE   (budget exhausted, continue2)
  budget exhausted   → DONE
  none of above      → SCANNING (niter++, self-tail-call)

Why the chunk boundary case does not need partial matching

bpf_strstr_tp_eoh only searches for traceparent: at positions
index < TRACE_BUF_SIZE − TRACE_PARENT_HEADER_LEN (= 956). A header starting at
offset 955 within the 1024-byte window ends at offset 1022, fully contained. The
header is never split across two reads. This is also why k_tp_chunk_step = 956
rather than 1024: the overlap ensures any match is always complete in one window.

The integration test testWithTraceparentAtChunkBoundary exercises exactly this case:
internal/test/integration/traceparent_extraction_test.go. The test constructs a
request where traceparent starts at exactly byte 956 of the wire data:

GET /with-huge-tp HTTP/1.1\r\n   = 28 bytes
Host: localhost:6000\r\n         = 22 bytes
A-Filler: <894 bytes>\r\n        = 12 + 894 bytes
traceparent: <value>             ← starts at byte 956

The integration test suite passes with OTEL_EBPF_BPF_MAX_REQUEST_TP_PARSE_SIZE_KB=4
on kernel 6.x. It includes 5 sub-tests covering: no traceparent, traceparent in first
chunk, forwarded traceparent, large headers (> 1 KB), and chunk boundary.


Relationship with tpinjector

tpinjector and the chunked scanner sit on opposite traffic directions and on
different attach points:

  • Chunked scanner — kretprobe on return_recvmsg (ingress). Extracts the
    incoming traceparent from request headers and stores it via
    set_trace_info_for_connection.
  • tpinjector — sk_msg / sockops on the egress path. Reads from
    outgoing_trace_map to inject a new traceparent into outgoing requests.

These two maps are distinct. The scanner never writes to outgoing_trace_map, and the
injector never reads from the ingress trace-info store. They do not interfere.

In the proxy / middleware scenario (service receives a request then makes an
outgoing call), the scanner extracts the incoming trace ID on ingress, OBI creates
a child span, and the injector propagates that child's trace ID on egress. This is
the correct W3C Trace Context behaviour and it is unchanged by this PR.


Limitations (out of scope for this PR)

Case Reason
HTTP/2 Headers are HPACK-compressed; traceparent does not appear as plaintext
HTTP/3 / QUIC Transport-layer encryption
TLS without userspace offloading Socket buffer is ciphertext at the kprobe
Multi-segment TCP (traceparent in a later recv call) Scanner covers one recvmsg call; if traceparent arrives in a separate segment OBI does not see it
Scatter-gather I/O, kernel 5.17–5.x The chunked rescan relies on orig_buf, which is only captured when nr_segs == 1. When nr_segs > 1 the initial read_iovec_ctx copy still covers up to 16 segments (total-capped at 8 KB), but no chunked rescan is available for subsequent chunks. Common clients (curl, Go net/http, Java HttpClient, nginx, envoy) send all headers in one buffer. On kernel ≥ 6.0 (ITER_UBUF) this limitation does not apply.
traceparent beyond configured budget By design — bounded to cap per-request CPU cost. Configurable up to 27 KB.

The scanner only activates when OTEL_EBPF_TRACK_REQUEST_HEADERS or
OTEL_EBPF_CONTEXT_PROPAGATION is enabled. Without either, g_bpf_traceparent_enabled
is false and no scan takes place.


Performance

These benchmarks were run against openresty in a local single-node kind cluster
(kernel 6.x). Traffic went through kubectl port-forward (loopback, same node), so
the absolute latency figures (7–37 ms) include port-forward overhead and should not be
read as production latency numbers. The relative differences between scenarios are
the useful signal here. OBI was deployed as a DaemonSet after each image rebuild from
a clean --no-cache build followed by make generate. Load was generated with
[oha 1.14.0](https://github.com/hatoo/oha) (100 concurrent connections, 30 s).
BPF program runtime metrics were collected via the kernel bpf_stats_enabled
interface (/proc/sys/kernel/bpf_stats_enabled).

Three scenarios were measured on this branch:

  • baselinetraceparent in the first 100 bytes; chunked scanner not triggered.
  • chunked — 2.5 KB filler header before traceparent; scanner activates,
    finds header in chunk 3.
  • boundary — filler sized so traceparent starts at exactly byte 956; found at
    niter = 1, chunk index 0 of the second window.

oha results

Kernel 6.17.0-19-generic, kind cluster, kubectl port-forward, oha 1.14.0,
100 connections, 30 s, OTEL_EBPF_BPF_MAX_REQUEST_TP_PARSE_SIZE_KB=27.

Scenario req size p50 p95 p99 req/sec avg success
main baseline ~200 B 7.9 ms 14.7 ms 19.6 ms 11,617 8.6 ms 100%
feat: baseline (scanner not triggered) ~200 B 7.4 ms 14.2 ms 18.7 ms 12,338 8.1 ms 100%
feat: chunked (2.5 KB filler, ~3 chunks) ~2.6 KB 20.2 ms 30.4 ms 37.0 ms 4,636 21.6 ms 100%
feat: boundary (traceparent at byte 956) ~1.1 KB 15.4 ms 22.5 ms 28.7 ms 6,114 16.4 ms 100%

All four runs were collected in the same session, on the same kernel
(6.17.0-19-generic), through the same kubectl port-forward, with images built from
clean --no-cache builds after make generate. The main takeaway is that main and
feat: baseline show equivalent throughput (~11-12K req/sec, p99 ≈ 19 ms), which is
what we would expect if the fast path in __obi_continue_protocol_http_tp adds no
measurable overhead when total_len ≤ TRACE_BUF_SIZE. The higher latency in the
chunked and boundary scenarios tracks the larger request size (~2.6 KB and ~1.1 KB),
not BPF runtime.

BPF program runtime (bpf_stats)

obi_parse_traceparent_http (prog 13) and obi_parse_traceparent_http_append
(prog 14) only run on requests where traceparent was not found in the first ~1 KB.
Their average runtime per invocation is measured below.

dashboard dashboard dashboard

The dashboards above show, per branch and scenario: BPF runs/sec and avg runtime per
program, BPF map memory, OBI process VmRSS, and Go heap allocation rate.

Summary across all four scenarios (same session, kernel 6.17.0-19-generic, --no-cache builds):

Scenario req size p50 p95 p99 req/sec avg
main baseline ~200 B 7.9 ms 14.7 ms 19.6 ms 11,617 8.6 ms
feat: baseline (scanner not triggered) ~200 B 7.4 ms 14.2 ms 18.7 ms 12,338 8.1 ms
feat: chunked (scanner active) ~2.6 KB 20.2 ms 30.4 ms 37.0 ms 4,636 21.6 ms
feat: boundary (scanner active) ~1.1 KB 15.4 ms 22.5 ms 28.7 ms 6,114 16.4 ms

main and feat: baseline are statistically equivalent. The higher latency in the
chunked and boundary scenarios is explained by the larger request sizes (more bytes
through the port-forward per request), not by BPF overhead.

Expected overhead

For the common case (scanner not triggered — traceparent in first chunk or feature
disabled), there is no additional overhead beyond the guard checks already present in
__obi_continue_protocol_http_tp. When the scanner does activate, the cost is one
bpf_probe_read_user of 1024 bytes per 956-byte step, bounded by the configured
budget (default 4 KB = 4 reads max).


Test environments

All tests were run with OBI deployed as a DaemonSet in a local single-node kind
cluster, instrumenting an openresty application. Each environment was a fresh
deployment rebuilt from this branch.

Kernel Distribution Chunked scanner TestTraceparentExtraction
5.15.0-91-generic Ubuntu 22.04.3 LTS disabled (< 5.17, legacy path) 3/5 PASS (chunked tests fail: scanner stubbed out, trace ID not extracted)
5.19.17-051917-generic Ubuntu 22.04.3 LTS enabled 5/5 PASS
6.17.0-19-generic Ubuntu 24.04.4 LTS enabled 5/5 PASS

On 5.15, FixupSpec swaps obi_continue_protocol_http and
obi_parse_traceparent_http{,_append} for stubs. The three baseline tests
(without_traceparent, with_traceparent, with_forwarded_traceparent) pass
normally; the two chunked tests fail as expected because bpf_loop (helper #181)
is not available on that kernel. OTEL_EBPF_OVERRIDE_BPF_LOOP_ENABLED must
not be set on unmodified 5.15 kernels: it bypasses the version check and makes
the verifier reject the program at load time.

Performance benchmarks (oha load tests) were run on the 6.17.0 kernel only.

Validation

@grcevski

grcevski commented May 4, 2026

Copy link
Copy Markdown
Contributor

This PR contains number of unrelated changes, if you want to implement a fix for this scenario the fix needs to be scoped to the smallest change possible. Also, the PR doesn't provide a test which we can use to prove the changes are good.

You've provided an example with large payload, but not an example where the traceparent can be split between chunks.

@rafaelroquetto

Copy link
Copy Markdown
Contributor

On top of what @grcevski said, whilst we are not against the usage of AI, I encourage you to review our contributing guidelines in relation to code ownership and understanding.

I will highlight the 2 important points here:

Contributors must review, test, and understand all changes before submitting a PR. This applies equally to manually written and tool-generated code. Use of AI or other tools does not transfer responsibility — the contributor is fully accountable for the final patch.

and

Code Ownership

AI tools are permitted, but they do not change what is expected of contributors. Every line in a PR is your responsibility, regardless of how it was produced. If you cannot explain a change, do not submit it.

Reviewers will ask you to walk through your changes. Inability to explain the rationale, the approach, or the details of any part of a PR is grounds for rejection. This includes changes generated or suggested by AI tools.

Avoid reimplementing existing code. AI tools frequently generate new implementations of functionality that already exists in the codebase. Before introducing any new utility, helper, abstraction, or pattern, search the codebase first. Reviewers will reject code that duplicates existing functionality.

Vet AI-generated plans and issue reports before filing. If you use an AI tool to draft an issue, design proposal, or implementation plan, read it critically before submitting. Check that it accurately reflects the codebase, does not contradict existing architecture, and does not propose work that is already done. Unvetted AI output creates noise and wastes reviewer time.

This PR description, the related issue, and the description of other related PRs are all AI generated. The code is mostly AI generated as well - but it does not follow the guidelines set by AGENTS.md - I encourage you to point your AI agent there - and also at copilot-instructions.md, ebpf.instructions.md and friends.

In particular, I'd highly encourage you to actually handcraft the PR descriptions, as that help convey the changes being proposed are well-understood - otherwise, it gets really difficult for us on the reviewing end to establish whether we are reviewing a pre-vetted code by the author, or just some AI slop.

@rafaelroquetto rafaelroquetto left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from my previous comment, I gave this a real look and I can't review it fairly in its current state. the scope and structure need to be addressed first.

obi_parse_traceparent_http alone is a good example of what I mean (it's overly complicated, there's a flag controlling I guess 5 different branches, etc...).

Comment thread bpf/generictracer/protocol_http.h Outdated
Comment thread bpf/generictracer/protocol_http.h Outdated
Comment thread bpf/common/http_types.h Outdated
@codecov

codecov Bot commented May 14, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 76.31579% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.39%. Comparing base (bff2bcd) to head (8b0bb0d).

Files with missing lines Patch % Lines
pkg/internal/ebpf/gotracer/gotracer.go 75.00% 3 Missing and 1 partial ⚠️
pkg/internal/ebpf/generictracer/generictracer.go 80.00% 3 Missing ⚠️
pkg/ebpf/common/common.go 50.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1988      +/-   ##
==========================================
+ Coverage   69.27%   69.39%   +0.11%     
==========================================
  Files         297      297              
  Lines       38419    38443      +24     
==========================================
+ Hits        26615    26676      +61     
+ Misses      10351    10317      -34     
+ Partials     1453     1450       -3     
Flag Coverage Δ
integration-test 52.09% <65.21%> (-0.71%) ⬇️
integration-test-arm 29.42% <57.89%> (+0.63%) ⬆️
integration-test-vm-5.15-lts 31.85% <65.21%> (+1.70%) ⬆️
integration-test-vm-6.18-lts 30.19% <65.21%> (+0.09%) ⬆️
k8s-integration-test 39.89% <65.21%> (+0.22%) ⬆️
oats-test 37.70% <65.21%> (-0.18%) ⬇️
unittests 61.55% <47.82%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@smehboub smehboub force-pushed the feat/chunked-traceparent-scanner branch from 0ca8973 to 9f69b7b Compare May 18, 2026 17:50
@smehboub

smehboub commented May 18, 2026

Copy link
Copy Markdown
Contributor Author

This PR contains number of unrelated changes, if you want to implement a fix for this scenario the fix needs to be scoped to the smallest change possible. Also, the PR doesn't provide a test which we can use to prove the changes are good.

You've provided an example with large payload, but not an example where the traceparent can be split between chunks.

Thanks for the feedback!

I agree that the scope is too broad. I'm working on splitting the PR into smaller, focused changes.

As for the test, I added ̀testWithTraceparentAtChunkBoundary to target byte 956, which is where the eBPF scanner transitions from chunk 0 to 1.

@smehboub smehboub force-pushed the feat/chunked-traceparent-scanner branch from 9f69b7b to fed9ed4 Compare May 18, 2026 18:34
@smehboub

Copy link
Copy Markdown
Contributor Author

Hi @rafaelroquetto @grcevski

Sorry, the PR description was not clear enough about what original_trace_id was actually doing.

original_trace_id was a workaround for a map key mismatch that only happens when OTEL_EBPF_BPF_BUFFER_SIZE_HTTP > 0 (http_max_captured_bytes).

When the large-buffer path is active, http_send_large_buffer writes body chunks into a Go map keyed on largeBufferKey{traceID, ...}, using tp.trace_id as it exists at the time the chunk is sent to the ring buffer. Later, when the chunked scanner finds the traceparent in the HTTP headers, it overwrites tp.trace_id with the client's value. By the time the final HTTP event arrives on the Go side, event.Tp.TraceId no longer matches the key that was used to store the chunks, so the lookup fails and OBI logs:

DEBUG missing large buffer for HTTP request traceID=<client_trace_id> ...

original_trace_id was a stable copy of tp.trace_id captured before the scanner ran, so both the BPF write and the Go lookup used the same key. It never affected which trace ID was propagated because tp.trace_id was always overwritten unconditionally at decode_hex(tp_p->tp.trace_id, ...) in both paths.

I hit this because I had OTEL_EBPF_BPF_BUFFER_SIZE_HTTP set in my test environment, which is why I introduced it.

Since this path is only reachable when that variable is explicitly configured (disabled by default), I removed original_trace_id entirely to keep this PR focused.

@smehboub

smehboub commented May 18, 2026

Copy link
Copy Markdown
Contributor Author

Aside from my previous comment, I gave this a real look and I can't review it fairly in its current state. the scope and structure need to be addressed first.

obi_parse_traceparent_http alone is a good example of what I mean (it's overly complicated, there's a flag controlling I guess 5 different branches, etc...).

obi_parse_traceparent_http is now split into two programs (index 13 for the initial-request path, index 14 for the append path), each with a single tp source and no branches on mode. Both delegate to __tp_chunk_scan.

@smehboub smehboub force-pushed the feat/chunked-traceparent-scanner branch from fed9ed4 to 0340299 Compare May 18, 2026 18:54
@smehboub smehboub marked this pull request as ready for review May 18, 2026 19:15
@smehboub smehboub requested a review from rafaelroquetto May 18, 2026 19:15
@rafaelroquetto

Copy link
Copy Markdown
Contributor

Hi @smehboub, just before I dive into it, would you mind explaining to me how exactly this works (please no AI generated prompts)?

(One feedback I can give you upfront - the comments are too verbose - please see our guidelines and the agents instructions).

@smehboub

Copy link
Copy Markdown
Contributor Author

Hi @smehboub, just before I dive into it, would you mind explaining to me how exactly this works (please no AI generated prompts)?

(One feedback I can give you upfront - the comments are too verbose - please see our guidelines and the agents instructions).

Hi @rafaelroquetto, I updated the description to make it clearer with these explanations and removed the obsolete comments.

@rafaelroquetto

Copy link
Copy Markdown
Contributor

@smehboub I saw the updated description, but it is still not enough for reviewing this change.

The main issue is that maintainers still have to reverse-engineer the state machine, memory model, and tradeoffs from the diff before we can even start reviewing the implementation. This PR is large and touches core HTTP parsing logic, so the design needs to be clear upfront.

What is missing is the actual design explanation. The description says the request is scanned in "successive windows" and mentions the new limit, but it does not explain the mechanism clearly enough: what data is scanned, where it comes from, when it is copied, what state is carried between scans, and which boundary cases this actually handles.

For example, "split across windows" can mean several different things here: scan chunks, BPF scratch buffers, application buffers, send/recv calls, TCP segments, or the large-buffer append path. These are not equivalent, and the PR needs to say which cases are handled and which are not.

Please also explain the performance side. If this copies request data while scanning, describe from where to where, how often, and whether it was profiled. This is a hot path.

Finally, I'd also like to understand how this relates to the existing traceparent handling in the tpinjector path. I didn't see that being touched.

The goal here is not to add process for its own sake. We need the high-level design before line-by-line review.

@smehboub

Copy link
Copy Markdown
Contributor Author

@smehboub I saw the updated description, but it is still not enough for reviewing this change.

The main issue is that maintainers still have to reverse-engineer the state machine, memory model, and tradeoffs from the diff before we can even start reviewing the implementation. This PR is large and touches core HTTP parsing logic, so the design needs to be clear upfront.

What is missing is the actual design explanation. The description says the request is scanned in "successive windows" and mentions the new limit, but it does not explain the mechanism clearly enough: what data is scanned, where it comes from, when it is copied, what state is carried between scans, and which boundary cases this actually handles.

For example, "split across windows" can mean several different things here: scan chunks, BPF scratch buffers, application buffers, send/recv calls, TCP segments, or the large-buffer append path. These are not equivalent, and the PR needs to say which cases are handled and which are not.

Please also explain the performance side. If this copies request data while scanning, describe from where to where, how often, and whether it was profiled. This is a hot path.

Finally, I'd also like to understand how this relates to the existing traceparent handling in the tpinjector path. I didn't see that being touched.

The goal here is not to add process for its own sake. We need the high-level design before line-by-line review.

Thank you very much for this feedback @rafaelroquetto. The DoD and your expectations are much clearer to me now.

I completely agree that a high-level design explanation is necessary before a line-by-line review, especially for a hot path touching core HTTP parsing.

I am going to convert this PR into a draft for now (I have a few personal constraints over the next few days). I will put in the necessary effort to address all your points (the state machine, memory model, performance impact, and the tpinjector relationship).

Thanks again for guiding me through the process!

@smehboub smehboub marked this pull request as draft May 19, 2026 22:06
@rafaelroquetto

Copy link
Copy Markdown
Contributor

@smehboub thank you! Just make sure you don't spend time/effort implementing something before we have the chance to discuss and vet it. Because you are touching a core part of the code, it's important that we are all on the same page about the approach chosen. It's best to hash out the high-level details now, and assess the feasibility of what is being proposed, than you spending time on something that may not take off.

Feel free to ask any questions here - happy to answer them.

@marctc

marctc commented May 20, 2026

Copy link
Copy Markdown
Contributor

@smehboub I didn't review the code either, but I read the title and clicked thru the issue you are trying to fix. I'm not aware of this piece of the code, but to me feels that solve the problem i'd great to see in the original issue different proposals on how to fix this, pros/cons and high level detail of each, rather than creating this gigantic PR, hard to reason for anyone

@smehboub

Copy link
Copy Markdown
Contributor Author

@smehboub I didn't review the code either, but I read the title and clicked thru the issue you are trying to fix. I'm not aware of this piece of the code, but to me feels that solve the problem i'd great to see in the original issue different proposals on how to fix this, pros/cons and high level detail of each, rather than creating this gigantic PR, hard to reason for anyone

Hi @marctc,

Absolutely, that makes complete sense. I would rather we get the design right than rush something into a core path.

To give you a bit of context on where I'm coming from: I work in an e-commerce business unit that does end-to-end tracing, from frontends down to data services, using higher-level libraries and agents. I see a lot of potential in OBI precisely because of its language-agnostic nature. The problem is that on the frontend side, headers tend to be large by default (partner cookies, auth tokens, etc.), which means this limitation effectively blocks end-to-end tracing for our use cases. That's what motivated me to file the bug/limitation in the first place.

I should also be transparent: I'm exploring this on my own time, out of personal interest and curiosity. I'm not paid or mandated by my employer to work on this. My only real goal is that this use case gets covered, regardless of whether this particular implementation is the one that lands.

I will work on putting together a proper design doc covering the state machine, memory model, performance side and the tpinjector relationship before touching any more code. If there are alternative approaches you think are worth considering before I go further, I'm very open to hearing them.

Thanks in advance.

@smehboub smehboub force-pushed the feat/chunked-traceparent-scanner branch from df838c4 to 8b0bb0d Compare June 1, 2026 00:31
@smehboub

smehboub commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

@smehboub I saw the updated description, but it is still not enough for reviewing this change.

The main issue is that maintainers still have to reverse-engineer the state machine, memory model, and tradeoffs from the diff before we can even start reviewing the implementation. This PR is large and touches core HTTP parsing logic, so the design needs to be clear upfront.

What is missing is the actual design explanation. The description says the request is scanned in "successive windows" and mentions the new limit, but it does not explain the mechanism clearly enough: what data is scanned, where it comes from, when it is copied, what state is carried between scans, and which boundary cases this actually handles.

For example, "split across windows" can mean several different things here: scan chunks, BPF scratch buffers, application buffers, send/recv calls, TCP segments, or the large-buffer append path. These are not equivalent, and the PR needs to say which cases are handled and which are not.

Please also explain the performance side. If this copies request data while scanning, describe from where to where, how often, and whether it was profiled. This is a hot path.

Finally, I'd also like to understand how this relates to the existing traceparent handling in the tpinjector path. I didn't see that being touched.

The goal here is not to add process for its own sake. We need the high-level design before line-by-line review.

Hi @rafaelroquetto

I updated the PR description to focus on the design first, before another line-by-line review.

It now explains the scan path end to end: what buffer is being scanned, how bytes are copied, what state is preserved across tail calls, which cases are handled versus out of scope, how the append path fits in, and how this interacts with the state machine, memory model, performance characteristics, and the existing tpinjector path.

I also tightened the wording around older kernels and scatter-gather I/O so it matches the current code, tests, and measured behavior.

If there is a part you would like me to make more explicit before code-level review, I’m happy to do that.

Thanks in advance.

@smehboub smehboub marked this pull request as ready for review June 1, 2026 01:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: traceparent header not detected when located beyond the eBPF capture buffer (large headers offset)

4 participants