feat(tracing): add chunked tail-call traceparent scanner for large HTTP headers#1988
feat(tracing): add chunked tail-call traceparent scanner for large HTTP headers#1988smehboub wants to merge 8 commits into
Conversation
|
This PR contains number of unrelated changes, if you want to implement a fix for this scenario the fix needs to be scoped to the smallest change possible. Also, the PR doesn't provide a test which we can use to prove the changes are good. You've provided an example with large payload, but not an example where the traceparent can be split between chunks. |
|
On top of what @grcevski said, whilst we are not against the usage of AI, I encourage you to review our contributing guidelines in relation to code ownership and understanding. I will highlight the 2 important points here:
and
This PR description, the related issue, and the description of other related PRs are all AI generated. The code is mostly AI generated as well - but it does not follow the guidelines set by In particular, I'd highly encourage you to actually handcraft the PR descriptions, as that help convey the changes being proposed are well-understood - otherwise, it gets really difficult for us on the reviewing end to establish whether we are reviewing a pre-vetted code by the author, or just some AI slop. |
rafaelroquetto
left a comment
There was a problem hiding this comment.
Aside from my previous comment, I gave this a real look and I can't review it fairly in its current state. the scope and structure need to be addressed first.
obi_parse_traceparent_http alone is a good example of what I mean (it's overly complicated, there's a flag controlling I guess 5 different branches, etc...).
f72b8c1 to
0ca8973
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1988 +/- ##
==========================================
+ Coverage 69.27% 69.39% +0.11%
==========================================
Files 297 297
Lines 38419 38443 +24
==========================================
+ Hits 26615 26676 +61
+ Misses 10351 10317 -34
+ Partials 1453 1450 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
0ca8973 to
9f69b7b
Compare
Thanks for the feedback! I agree that the scope is too broad. I'm working on splitting the PR into smaller, focused changes. As for the test, I added ̀ |
9f69b7b to
fed9ed4
Compare
|
Sorry, the PR description was not clear enough about what
When the large-buffer path is active,
I hit this because I had Since this path is only reachable when that variable is explicitly configured (disabled by default), I removed |
|
fed9ed4 to
0340299
Compare
|
Hi @smehboub, just before I dive into it, would you mind explaining to me how exactly this works (please no AI generated prompts)? (One feedback I can give you upfront - the comments are too verbose - please see our guidelines and the agents instructions). |
Hi @rafaelroquetto, I updated the description to make it clearer with these explanations and removed the obsolete comments. |
|
@smehboub I saw the updated description, but it is still not enough for reviewing this change. The main issue is that maintainers still have to reverse-engineer the state machine, memory model, and tradeoffs from the diff before we can even start reviewing the implementation. This PR is large and touches core HTTP parsing logic, so the design needs to be clear upfront. What is missing is the actual design explanation. The description says the request is scanned in "successive windows" and mentions the new limit, but it does not explain the mechanism clearly enough: what data is scanned, where it comes from, when it is copied, what state is carried between scans, and which boundary cases this actually handles. For example, "split across windows" can mean several different things here: scan chunks, BPF scratch buffers, application buffers, send/recv calls, TCP segments, or the large-buffer append path. These are not equivalent, and the PR needs to say which cases are handled and which are not. Please also explain the performance side. If this copies request data while scanning, describe from where to where, how often, and whether it was profiled. This is a hot path. Finally, I'd also like to understand how this relates to the existing traceparent handling in the tpinjector path. I didn't see that being touched. The goal here is not to add process for its own sake. We need the high-level design before line-by-line review. |
Thank you very much for this feedback @rafaelroquetto. The DoD and your expectations are much clearer to me now. I completely agree that a high-level design explanation is necessary before a line-by-line review, especially for a hot path touching core HTTP parsing. I am going to convert this PR into a draft for now (I have a few personal constraints over the next few days). I will put in the necessary effort to address all your points (the state machine, memory model, performance impact, and the tpinjector relationship). Thanks again for guiding me through the process! |
|
@smehboub thank you! Just make sure you don't spend time/effort implementing something before we have the chance to discuss and vet it. Because you are touching a core part of the code, it's important that we are all on the same page about the approach chosen. It's best to hash out the high-level details now, and assess the feasibility of what is being proposed, than you spending time on something that may not take off. Feel free to ask any questions here - happy to answer them. |
|
@smehboub I didn't review the code either, but I read the title and clicked thru the issue you are trying to fix. I'm not aware of this piece of the code, but to me feels that solve the problem i'd great to see in the original issue different proposals on how to fix this, pros/cons and high level detail of each, rather than creating this gigantic PR, hard to reason for anyone |
Hi @marctc, Absolutely, that makes complete sense. I would rather we get the design right than rush something into a core path. To give you a bit of context on where I'm coming from: I work in an e-commerce business unit that does end-to-end tracing, from frontends down to data services, using higher-level libraries and agents. I see a lot of potential in OBI precisely because of its language-agnostic nature. The problem is that on the frontend side, headers tend to be large by default (partner cookies, auth tokens, etc.), which means this limitation effectively blocks end-to-end tracing for our use cases. That's what motivated me to file the bug/limitation in the first place. I should also be transparent: I'm exploring this on my own time, out of personal interest and curiosity. I'm not paid or mandated by my employer to work on this. My only real goal is that this use case gets covered, regardless of whether this particular implementation is the one that lands. I will work on putting together a proper design doc covering the state machine, memory model, performance side and the tpinjector relationship before touching any more code. If there are alternative approaches you think are worth considering before I go further, I'm very open to hearing them. Thanks in advance. |
df838c4 to
8b0bb0d
Compare
I updated the PR description to focus on the design first, before another line-by-line review. It now explains the scan path end to end: what buffer is being scanned, how bytes are copied, what state is preserved across tail calls, which cases are handled versus out of scope, how the append path fits in, and how this interacts with the state machine, memory model, performance characteristics, and the existing tpinjector path. I also tightened the wording around older kernels and scatter-gather I/O so it matches the current code, tests, and measured behavior. If there is a part you would like me to make more explicit before code-level review, I’m happy to do that. Thanks in advance. |
Summary
Find
traceparentheaders buried after large HTTP headers (auth tokens, cookies, …).Fixes: #1381
Description
When a service places large headers before
traceparent, the header can land beyondthe first ~1 KB that OBI scans. OBI then misses it and generates a new trace ID,
silently breaking the distributed trace.
This PR fixes that by scanning the request in successive 956-byte windows until the
traceparentheader is found or the end of the HTTP headers is reached. The scanstops before the request body and never reads more than the configured budget
(
OTEL_EBPF_BPF_MAX_REQUEST_TP_PARSE_SIZE_KB, default 4 KB, max 27 KB).The feature is active when
OTEL_EBPF_CONTEXT_PROPAGATIONorOTEL_EBPF_TRACK_REQUEST_HEADERSis enabled and requires kernel 5.17 or newer.On older kernels OBI falls back to the existing first-chunk behaviour automatically.
Design
Below is the design explanation needed for review: what is scanned, where it comes
from, what state is carried between scans, which boundary cases are handled, and how
this relates to the tpinjector path.
What is scanned, where it comes from, and when it is copied
The scanner reuses the request buffer pointer already captured during
return_recvmsg. On kernel ≥ 6.0,ITER_UBUFexposes a single contiguoususer-space pointer (
iov_ctx->ubuf), stored asargs->orig_buf. On kernels5.17–5.x,
ITER_IOVECis used andiov_ctx->iov[0].iov_baseserves as that base(first segment only — see Limitations below). The SSL / Java-TLS path sets
u_buf_is_user = 1and passes the already-decrypted user-space buffer unchanged;orig_bufis not set on these paths, so chunked iterations readu_buf + offsetdirectly(the SSL record buffer) rather than
orig_buf + offset.The initial pass in
__obi_continue_protocol_http_tpreads up toTRACE_BUF_SIZE - 1bytes using
bpf_probe_read(generic — handles both kernel and user memory) fromargs->u_bufintotp_char_buf_mem(). This is the existing first-chunk read.Each subsequent chunked scan in
__tp_chunk_scanperforms one read of up to1024 bytes into the same per-CPU scratch buffer. The helper function is selected at
runtime based on
u_buf_is_user:bpf_probe_read_userwhen reading from a user-space buffer (ITER_UBUF, SSL, orJava-TLS path)
bpf_probe_read_kernelwhen reading from a kernel-side iovecThe scanner does not write any new persistent state. With a 4 KB budget it issues at
most 4 additional reads beyond the initial pass, and only for requests where
traceparentwas not found in the first kilobyte.State carried between tail calls
There is no new BPF map. State stays in
call_protocol_args_t, the existing per-CPUscratch accessed via
protocol_args(). Four fields were added:niteroffset = niter × 956orig_bufreturn_recvmsgfull_bytes_lenreturn_recvmsgbytes_len)u_buf_is_userhandle_buf_*callbpf_probe_read_user(1) vs_kernel(0)Between iterations, the only mutation is
niter++inside__tp_chunk_scanwhen itreturns
k_tp_scan_continue. Everything else (orig_buf,full_bytes_len,u_buf_is_user) is fixed for the lifetime of the event.Programs and tail-call budget
Two new programs are registered in the existing
jump_table:Program 13 —
obi_parse_traceparent_http(initial-request path)Triggered from
__obi_continue_protocol_http_tpwhen the first-passbpf_strstr_tp_loopreturns nothing and the total buffer length exceeds 1023 bytes.Starts at
niter = 1(chunk 0 already scanned). Runs up to 28 times (niter=1..28;the 28th invocation reads but does not emit a further tail-call), then falls through
to
__obi_continue2_protocol_httpwhich is inlined so it always runs even when thetail-call budget is exhausted. When
connection_meta_by_directionreturns NULL (nodirection metadata available), the program tail-calls
k_tail_continue2_protocol_httpdirectly without scanning.
Program 14 —
obi_parse_traceparent_http_append(large-buffer append path)Triggered from
obi_handle_buf_with_argsand__obi_protocol_httpwhenstill_reading(info)is true — i.e., when OBI is accumulating a multi-recv-callHTTP request. Starts at
niter = 0;base_offset = info->len − args->bytes_lengives the position of this chunk within the total request. Calls
http_send_large_bufferat the end on all paths whereinfois available(including when
metais NULL), so the streaming semantics are preserved.Tail-call budget:
The 27 KB upper bound on
OTEL_EBPF_BPF_MAX_REQUEST_TP_PARSE_SIZE_KBcomes from theprog 14 budget (tightest path). Prog 14 runs for niter=0..28 (29 invocations); at
niter=28 the bottom guard
niter+1 < 29is false and__tp_chunk_scanreturnsk_tp_scan_donewithout emitting a further tail-call. The last read starts at offset28 × 956 = 26,768bytes and is capped to the remaining budget:27,648 − 26,768 = 880 bytes. Last byte covered: index 26,768 + 879 = 27,647(= 27 × 1024 − 1), covering 27,648 bytes = 27 KB total (indices 0..27,647).
29 × 956 = 27,724is the start of the next chunk, which is never read: thebottom guard's budget check (
27,724 < 27,648) is false, so no tail-call is issued.If the initial tail call to prog 13 fails (for example because the tail-call table is
not loaded yet at startup),
__obi_continue_protocol_http_tpfalls through to itsnormal completion path. In that case no extra chunk scan happens, but the existing
first-chunk result is preserved. Prog 13 then inlines
__obi_continue2_protocol_httpat the end regardless of how the scan ended, so the request is never dropped.
State machine
The chunked scanner only runs on the
bpf_strstr_tp_loopcode path (kernel ≥ 5.17with
bpf_loopsupport). On older kernels, OBI usesbpf_strstr_tp_loop__legacy, sothe condition
tp_loop_fn == bpf_strstr_tp_loopis never satisfied and the scanner isnever triggered, regardless of
OTEL_EBPF_BPF_MAX_REQUEST_TP_PARSE_SIZE_KB.Why the chunk boundary case does not need partial matching
bpf_strstr_tp_eohonly searches fortraceparent:at positionsindex < TRACE_BUF_SIZE − TRACE_PARENT_HEADER_LEN(= 956). A header starting atoffset 955 within the 1024-byte window ends at offset 1022, fully contained. The
header is never split across two reads. This is also why
k_tp_chunk_step = 956rather than 1024: the overlap ensures any match is always complete in one window.
The integration test
testWithTraceparentAtChunkBoundaryexercises exactly this case:internal/test/integration/traceparent_extraction_test.go. The test constructs arequest where
traceparentstarts at exactly byte 956 of the wire data:The integration test suite passes with
OTEL_EBPF_BPF_MAX_REQUEST_TP_PARSE_SIZE_KB=4on kernel 6.x. It includes 5 sub-tests covering: no traceparent, traceparent in first
chunk, forwarded traceparent, large headers (> 1 KB), and chunk boundary.
Relationship with tpinjector
tpinjectorand the chunked scanner sit on opposite traffic directions and ondifferent attach points:
return_recvmsg(ingress). Extracts theincoming
traceparentfrom request headers and stores it viaset_trace_info_for_connection.outgoing_trace_mapto inject a newtraceparentinto outgoing requests.These two maps are distinct. The scanner never writes to
outgoing_trace_map, and theinjector never reads from the ingress trace-info store. They do not interfere.
In the proxy / middleware scenario (service receives a request then makes an
outgoing call), the scanner extracts the incoming trace ID on ingress, OBI creates
a child span, and the injector propagates that child's trace ID on egress. This is
the correct W3C Trace Context behaviour and it is unchanged by this PR.
Limitations (out of scope for this PR)
traceparentdoes not appear as plaintextrecvmsgcall; iftraceparentarrives in a separate segment OBI does not see itorig_buf, which is only captured whennr_segs == 1. Whennr_segs > 1the initialread_iovec_ctxcopy still covers up to 16 segments (total-capped at 8 KB), but no chunked rescan is available for subsequent chunks. Common clients (curl, Gonet/http, JavaHttpClient, nginx, envoy) send all headers in one buffer. On kernel ≥ 6.0 (ITER_UBUF) this limitation does not apply.traceparentbeyond configured budgetThe scanner only activates when
OTEL_EBPF_TRACK_REQUEST_HEADERSorOTEL_EBPF_CONTEXT_PROPAGATIONis enabled. Without either,g_bpf_traceparent_enabledis false and no scan takes place.
Performance
These benchmarks were run against openresty in a local single-node kind cluster
(kernel 6.x). Traffic went through
kubectl port-forward(loopback, same node), sothe absolute latency figures (7–37 ms) include port-forward overhead and should not be
read as production latency numbers. The relative differences between scenarios are
the useful signal here. OBI was deployed as a DaemonSet after each image rebuild from
a clean
--no-cachebuild followed bymake generate. Load was generated with[oha 1.14.0](https://github.com/hatoo/oha) (100 concurrent connections, 30 s).
BPF program runtime metrics were collected via the kernel
bpf_stats_enabledinterface (
/proc/sys/kernel/bpf_stats_enabled).Three scenarios were measured on this branch:
traceparentin the first 100 bytes; chunked scanner not triggered.traceparent; scanner activates,finds header in chunk 3.
traceparentstarts at exactly byte 956; found atniter = 1, chunk index 0 of the second window.oha results
Kernel 6.17.0-19-generic, kind cluster,
kubectl port-forward, oha 1.14.0,100 connections, 30 s,
OTEL_EBPF_BPF_MAX_REQUEST_TP_PARSE_SIZE_KB=27.All four runs were collected in the same session, on the same kernel
(
6.17.0-19-generic), through the samekubectl port-forward, with images built fromclean
--no-cachebuilds aftermake generate. The main takeaway is thatmainandfeat: baselineshow equivalent throughput (~11-12K req/sec, p99 ≈ 19 ms), which iswhat we would expect if the fast path in
__obi_continue_protocol_http_tpadds nomeasurable overhead when
total_len ≤ TRACE_BUF_SIZE. The higher latency in thechunked and boundary scenarios tracks the larger request size (~2.6 KB and ~1.1 KB),
not BPF runtime.
BPF program runtime (bpf_stats)
obi_parse_traceparent_http(prog 13) andobi_parse_traceparent_http_append(prog 14) only run on requests where
traceparentwas not found in the first ~1 KB.Their average runtime per invocation is measured below.
The dashboards above show, per branch and scenario: BPF runs/sec and avg runtime per
program, BPF map memory, OBI process VmRSS, and Go heap allocation rate.
Summary across all four scenarios (same session, kernel 6.17.0-19-generic,
--no-cachebuilds):mainandfeat: baselineare statistically equivalent. The higher latency in thechunked and boundary scenarios is explained by the larger request sizes (more bytes
through the port-forward per request), not by BPF overhead.
Expected overhead
For the common case (scanner not triggered —
traceparentin first chunk or featuredisabled), there is no additional overhead beyond the guard checks already present in
__obi_continue_protocol_http_tp. When the scanner does activate, the cost is onebpf_probe_read_userof 1024 bytes per 956-byte step, bounded by the configuredbudget (default 4 KB = 4 reads max).
Test environments
All tests were run with OBI deployed as a DaemonSet in a local single-node kind
cluster, instrumenting an openresty application. Each environment was a fresh
deployment rebuilt from this branch.
TestTraceparentExtractionOn 5.15,
FixupSpecswapsobi_continue_protocol_httpandobi_parse_traceparent_http{,_append}for stubs. The three baseline tests(
without_traceparent,with_traceparent,with_forwarded_traceparent) passnormally; the two chunked tests fail as expected because
bpf_loop(helper #181)is not available on that kernel.
OTEL_EBPF_OVERRIDE_BPF_LOOP_ENABLEDmustnot be set on unmodified 5.15 kernels: it bypasses the version check and makes
the verifier reject the program at load time.
Performance benchmarks (oha load tests) were run on the 6.17.0 kernel only.
Validation