Skip to content

fix(tracing): correct trace-id not-found sentinel and use Copy() for dummy BPF program specs#1986

Closed
smehboub wants to merge 1 commit into
open-telemetry:mainfrom
smehboub:fix/trace-id-sentinel-and-dummy-copy
Closed

fix(tracing): correct trace-id not-found sentinel and use Copy() for dummy BPF program specs#1986
smehboub wants to merge 1 commit into
open-telemetry:mainfrom
smehboub:fix/trace-id-sentinel-and-dummy-copy

Conversation

@smehboub

@smehboub smehboub commented May 4, 2026

Copy link
Copy Markdown
Contributor

fix(tracing): correct trace-id not-found sentinel and use Copy() for dummy BPF program specs

Summary

Three independent bug fixes with no behaviour change for end users. Safe to
merge before the chunked traceparent scanner (follow-up PR).

Changes

bpf/common/trace_util.h — sentinel fix in bpf_strstr_tp_loop

The callback initialised pos = 0 and returned early when pos was non-zero
after the bpf_loop pass. Any traceparent header located at byte offset 0 in
the capture buffer was therefore silently dropped: pos = 0 was
indistinguishable from "not found".

k_tp_pos_not_found = 0xffffffffu is introduced as an explicit sentinel.
0xffffffff cannot be a valid offset because it falls well outside the
TRACE_BUF_SIZE range. The bpf_strstr_tp_loop initialiser and its return
check are updated to use it.

pkg/ebpf/common/common.godummy.Copy() in FixupSpec

FixupSpec inserted the same *ebpf.ProgramSpec pointer into two separate
slots of spec.Programs. The ebpf library mutates the spec in place during
collection load; sharing a pointer means the second insertion overwrites the
first, causing non-deterministic verifier failures depending on load order.
.Copy() gives each slot an independent copy.

pkg/ebpf/common/http_transform_test.goTestToRequestTraceMissingLargeBuffers

Adds a unit test for the ReadHTTPInfoIntoSpan path where
HasLargeBuffers = 1 but no entry is found in the large-buffer map (e.g.
connection torn down before the second fragment arrived). Previously untested.

Testing

make test

All three fixes are covered by existing or new unit tests. No integration test
changes are required.

How to review

  • trace_util.h: one hunk adds the sentinel enum, one hunk updates
    bpf_strstr_tp_loop to use it — no algorithm change.
  • common.go: two lines change = dummy to = dummy.Copy() — no logic
    change.
  • http_transform_test.go: new test only, no production code change.

…dummy BPF program specs

- Replace the implicit 0 sentinel in bpf_strstr_tp_loop with an explicit
  k_tp_pos_not_found = 0xffffffffu constant. A traceparent at byte offset 0
  would previously be silently dropped because pos=0 was treated as 'not
  found'. Using 0xffffffff avoids the ambiguity since it falls outside the
  valid TRACE_BUF_SIZE range.

- Use dummy.Copy() when inserting placeholder BPF program specs in
  FixupSpec. The previous code shared a single *ebpf.ProgramSpec pointer
  across two map entries; the ebpf library mutates the spec during load and
  a shared pointer causes non-deterministic verifier failures depending on
  load order.

- Add TestToRequestTraceMissingLargeBuffers to cover the missing-large-
  buffers branch in ReadHTTPInfoIntoSpan.

These fixes are independent of the chunked traceparent scanner and can be
reviewed and merged on their own.
@grcevski

grcevski commented May 4, 2026

Copy link
Copy Markdown
Contributor

The callback initialised pos = 0 and returned early when pos was non-zero
after the bpf_loop pass. Any traceparent header located at byte offset 0 in
the capture buffer was therefore silently dropped: pos = 0 was
indistinguishable from "not found".

Can you explain in which circumstance the traceparent header will be in 0 position? How does this happen with regular HTTP protocol?

@mmat11

mmat11 commented May 4, 2026

Copy link
Copy Markdown
Contributor

Hi @smehboub these 3 changes are unrelated to each other, I think it'd be better to split them in different PRs.

  1. the sentinel fix lgtm
  2. I don't understand what copying the spec would accomplish there
  3. the test lgtm

@rafaelroquetto

Copy link
Copy Markdown
Contributor

Hey @smehboub, could you further split into atomic PRs, as per guidelines? "Changes must be small and focused. Avoid unrelated edits, cleanup, or formatting changes."

We've got at least 3 orthogonal changes here.

By splitting into atomic PRs, not only does it make it easy for them to be reviewed, but it also brings about atomic commits that can be more easily reversed or bisected.

Thanks.

@rafaelroquetto rafaelroquetto left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be split

@smehboub

Copy link
Copy Markdown
Contributor Author

The callback initialised pos = 0 and returned early when pos was non-zero
after the bpf_loop pass. Any traceparent header located at byte offset 0 in
the capture buffer was therefore silently dropped: pos = 0 was
indistinguishable from "not found".

Can you explain in which circumstance the traceparent header will be in 0 position? How does this happen with regular HTTP protocol?

Actually, the main idea is to prevent offset 0 from being interpreted as 'not found'.

The regular HTTP request starts with the request line (ex: GET / HTTP/1.1\r\n) and not with the Traceparent: header, but the Traceparent: header can fall right at the start of a window when scanning in chunks. If I keep the initialization at 0, I cannot tell if it was found at the very beginning or if it's missing entirely (which is something I noticed during my tests).

I preferred to clear up this ambiguity in a separate PR (for future contributions, because it would have made my life easier if this was already there for my recent work) by switching to 0xffffffff. This ensures offset 0 remains a valid result for bpf_strstr_tp_eoh (especially with #1988).

@smehboub

Copy link
Copy Markdown
Contributor Author

Hi @smehboub these 3 changes are unrelated to each other, I think it'd be better to split them in different PRs.

  1. the sentinel fix lgtm
  2. I don't understand what copying the spec would accomplish there
  3. the test lgtm

Hi @mmat11

Ok to split! I will open separate PRs for the sentinel fix and the tests.

Regarding the spec.Copy(), here is the reasoning:

Since these 'dummy' programs are shared pointers to the same instance, any mutation would silently affect both entries (unless I am mistaken, the cilium/ebpf library performs bytecode adjustments at load time, which would be a mutation).

Even if they are currently unused, wouldn't using .Copy() be a safer, more defensive approach to ensure proper memory isolation and prevent any potential 'double-patching' or rejection by the kernel during the loading process ?

That said, if you prefer to keep the original shared reference for simplicity, I’m happy to remove it in the new PR. Let me know!

@mmat11

mmat11 commented May 11, 2026

Copy link
Copy Markdown
Contributor

Regarding the spec.Copy(), here is the reasoning:

Since these 'dummy' programs are shared pointers to the same instance, any mutation would silently affect both entries (unless I am mistaken, the cilium/ebpf library performs bytecode adjustments at load time, which would be a mutation).

Even if they are currently unused, wouldn't using .Copy() be a safer, more defensive approach to ensure proper memory isolation and prevent any potential 'double-patching' or rejection by the kernel during the loading process ?

That said, if you prefer to keep the original shared reference for simplicity, I’m happy to remove it in the new PR. Let me know!

these are dummy single instruction programs, there's no mutation/adjustment going on so -- unless you had some practical issue while running OBI because of this -- I don't think this change is needed

@smehboub

Copy link
Copy Markdown
Contributor Author

The callback initialised pos = 0 and returned early when pos was non-zero
after the bpf_loop pass. Any traceparent header located at byte offset 0 in
the capture buffer was therefore silently dropped: pos = 0 was
indistinguishable from "not found".

Can you explain in which circumstance the traceparent header will be in 0 position? How does this happen with regular HTTP protocol?

Actually, the main idea is to prevent offset 0 from being interpreted as 'not found'.

The regular HTTP request starts with the request line (ex: GET / HTTP/1.1\r\n) and not with the Traceparent: header, but the Traceparent: header can fall right at the start of a window when scanning in chunks. If I keep the initialization at 0, I cannot tell if it was found at the very beginning or if it's missing entirely (which is something I noticed during my tests).

I preferred to clear up this ambiguity in a separate PR (for future contributions, because it would have made my life easier if this was already there for my recent work) by switching to 0xffffffff. This ensures offset 0 remains a valid result for bpf_strstr_tp_eoh (especially with #1988).

Hi @grcevski.

I have opened a separate PR #2095 for this change and adjusted the commit type, as it's more of a minor improvement to anticipate future needs.

Thanks for the review!

@smehboub

Copy link
Copy Markdown
Contributor Author

Hi @mmat11 @rafaelroquetto.

Following your advice, I have split this PR into two separate ones (#2095 and #2096).

Thanks for the review!

@smehboub smehboub closed this May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants