fix(tracing): correct trace-id not-found sentinel and use Copy() for dummy BPF program specs#1986
Conversation
…dummy BPF program specs - Replace the implicit 0 sentinel in bpf_strstr_tp_loop with an explicit k_tp_pos_not_found = 0xffffffffu constant. A traceparent at byte offset 0 would previously be silently dropped because pos=0 was treated as 'not found'. Using 0xffffffff avoids the ambiguity since it falls outside the valid TRACE_BUF_SIZE range. - Use dummy.Copy() when inserting placeholder BPF program specs in FixupSpec. The previous code shared a single *ebpf.ProgramSpec pointer across two map entries; the ebpf library mutates the spec during load and a shared pointer causes non-deterministic verifier failures depending on load order. - Add TestToRequestTraceMissingLargeBuffers to cover the missing-large- buffers branch in ReadHTTPInfoIntoSpan. These fixes are independent of the chunked traceparent scanner and can be reviewed and merged on their own.
Can you explain in which circumstance the traceparent header will be in 0 position? How does this happen with regular HTTP protocol? |
|
Hi @smehboub these 3 changes are unrelated to each other, I think it'd be better to split them in different PRs.
|
|
Hey @smehboub, could you further split into atomic PRs, as per guidelines? "Changes must be small and focused. Avoid unrelated edits, cleanup, or formatting changes." We've got at least 3 orthogonal changes here. By splitting into atomic PRs, not only does it make it easy for them to be reviewed, but it also brings about atomic commits that can be more easily reversed or bisected. Thanks. |
rafaelroquetto
left a comment
There was a problem hiding this comment.
Needs to be split
Actually, the main idea is to prevent offset 0 from being interpreted as 'not found'. The regular HTTP request starts with the request line (ex: GET / HTTP/1.1\r\n) and not with the Traceparent: header, but the Traceparent: header can fall right at the start of a window when scanning in chunks. If I keep the initialization at 0, I cannot tell if it was found at the very beginning or if it's missing entirely (which is something I noticed during my tests). I preferred to clear up this ambiguity in a separate PR (for future contributions, because it would have made my life easier if this was already there for my recent work) by switching to 0xffffffff. This ensures offset 0 remains a valid result for bpf_strstr_tp_eoh (especially with #1988). |
Hi @mmat11 Ok to split! I will open separate PRs for the sentinel fix and the tests. Regarding the spec.Copy(), here is the reasoning: Since these 'dummy' programs are shared pointers to the same instance, any mutation would silently affect both entries (unless I am mistaken, the cilium/ebpf library performs bytecode adjustments at load time, which would be a mutation). Even if they are currently unused, wouldn't using .Copy() be a safer, more defensive approach to ensure proper memory isolation and prevent any potential 'double-patching' or rejection by the kernel during the loading process ? That said, if you prefer to keep the original shared reference for simplicity, I’m happy to remove it in the new PR. Let me know! |
these are dummy single instruction programs, there's no mutation/adjustment going on so -- unless you had some practical issue while running OBI because of this -- I don't think this change is needed |
Hi @grcevski. I have opened a separate PR #2095 for this change and adjusted the commit type, as it's more of a minor improvement to anticipate future needs. Thanks for the review! |
fix(tracing): correct trace-id not-found sentinel and use Copy() for dummy BPF program specs
Summary
Three independent bug fixes with no behaviour change for end users. Safe to
merge before the chunked traceparent scanner (follow-up PR).
Changes
bpf/common/trace_util.h— sentinel fix inbpf_strstr_tp_loopThe callback initialised
pos = 0and returned early whenposwas non-zeroafter the
bpf_looppass. Any traceparent header located at byte offset 0 inthe capture buffer was therefore silently dropped:
pos = 0wasindistinguishable from "not found".
k_tp_pos_not_found = 0xffffffffuis introduced as an explicit sentinel.0xffffffffcannot be a valid offset because it falls well outside theTRACE_BUF_SIZErange. Thebpf_strstr_tp_loopinitialiser and its returncheck are updated to use it.
pkg/ebpf/common/common.go—dummy.Copy()inFixupSpecFixupSpecinserted the same*ebpf.ProgramSpecpointer into two separateslots of
spec.Programs. The ebpf library mutates the spec in place duringcollection load; sharing a pointer means the second insertion overwrites the
first, causing non-deterministic verifier failures depending on load order.
.Copy()gives each slot an independent copy.pkg/ebpf/common/http_transform_test.go—TestToRequestTraceMissingLargeBuffersAdds a unit test for the
ReadHTTPInfoIntoSpanpath whereHasLargeBuffers = 1but no entry is found in the large-buffer map (e.g.connection torn down before the second fragment arrived). Previously untested.
Testing
All three fixes are covered by existing or new unit tests. No integration test
changes are required.
How to review
trace_util.h: one hunk adds the sentinel enum, one hunk updatesbpf_strstr_tp_loopto use it — no algorithm change.common.go: two lines change= dummyto= dummy.Copy()— no logicchange.
http_transform_test.go: new test only, no production code change.