Skip to content

feat: add distributed tracing with timing spans and trace context propagation#1543

Merged
dirgim merged 1 commit into
konflux-ci:mainfrom
ci-operator:distributed-tracing
Jun 25, 2026
Merged

feat: add distributed tracing with timing spans and trace context propagation#1543
dirgim merged 1 commit into
konflux-ci:mainfrom
ci-operator:distributed-tracing

Conversation

@ci-operator

@ci-operator ci-operator commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

Propagate W3C trace context from build PipelineRuns through Snapshots,
integration PipelineRuns, and Release CRs. Emit waitDuration and
executeDuration timing spans on PipelineRun completion across both
ComponentGroup and Application model flows. When a Snapshot is skipped
because a newer one for the same Application has already released,
emit a supersede waitDuration so the trace distinguishes deliberate
dedup from a broken chain.

Annotation constants live in pkg/tracing: TimingEmittedAnnotation
under delivery.tekton.dev (controller-internal), SpanContextAnnotation
under tekton.dev (cross-service contract).

See docs/tracing.md.

Assisted-by: Claude Code

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request integrates distributed tracing using OpenTelemetry, enabling the emission of wait and execution duration spans for PipelineRuns and the propagation of trace context across Snapshots and Releases. The review feedback identifies several logic and efficiency improvements, such as ensuring spans are only marked as emitted upon PipelineRun completion to prevent partial traces, optimizing TaskRun lookups using List operations instead of sequential Gets, handling malformed sampler arguments, and utilizing a composite propagator to maintain support for standard headers like Baggage.

Comment thread internal/controller/buildpipeline/buildpipeline_adapter.go Outdated
Comment thread pkg/tracing/timing_spans.go Outdated
Comment thread pkg/tracing/tracing.go Outdated
Comment thread pkg/tracing/tracing.go Outdated
Comment thread pkg/tracing/tracing.go
@ci-operator ci-operator force-pushed the distributed-tracing branch 4 times, most recently from 0cb7da2 to dc66ed5 Compare April 30, 2026 20:09
@kasemAlem

Copy link
Copy Markdown
Contributor

/ok-to-test

@ci-operator ci-operator changed the title feat: add distributed tracing with timing spans and trace context propagation [DRAFT] feat: add distributed tracing with timing spans and trace context propagation May 4, 2026
@ci-operator ci-operator force-pushed the distributed-tracing branch from dc66ed5 to 3db0d33 Compare May 6, 2026 22:48
@ci-operator

Copy link
Copy Markdown
Contributor Author

Found and split out an unrelated regression while testing this PR end-to-end on CRC: #1555. Without it, Snapshot -> Release stalls on Application-model tenants. Not a blocker for this PR; merging order doesn't matter.

@ci-operator ci-operator changed the title [DRAFT] feat: add distributed tracing with timing spans and trace context propagation feat: add distributed tracing with timing spans and trace context propagation May 11, 2026
@ci-operator ci-operator force-pushed the distributed-tracing branch from ddc4a5a to a5ecfdb Compare May 14, 2026 13:44
Comment thread internal/controller/buildpipeline/buildpipeline_adapter.go
Comment thread internal/controller/buildpipeline/buildpipeline_adapter.go
@kasemAlem

Copy link
Copy Markdown
Contributor

@ci-operator ,
looks good at all , this is a massive effort thanks you ,

integration-service under implementation of new component model which is one the core components in service
the implementations are made on phases this why we are now maintaining both old and new componentGroup and application, once the whole implementation we will be with only componentGroup only
I bet that would much easier.
That said if you want to continue with implementing this distributed tracing feature you should take care of both flows , for me I just saw two places it's not emitted.
I'll ask from our team colleagues to review

@codecov-commenter

codecov-commenter commented May 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.03559% with 28 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.58%. Comparing base (5b4f7e5) to head (659b591).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/tracing/tracing.go 89.53% 6 Missing and 3 partials ⚠️
cmd/main.go 16.66% 4 Missing and 1 partial ⚠️
internal/controller/snapshot/snapshot_adapter.go 83.33% 2 Missing and 2 partials ⚠️
release/releaseplan.go 20.00% 3 Missing and 1 partial ⚠️
.../controller/buildpipeline/buildpipeline_adapter.go 85.71% 1 Missing and 1 partial ⚠️
...integrationpipeline/integrationpipeline_adapter.go 90.90% 1 Missing and 1 partial ⚠️
tekton/integration_pipeline.go 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1543      +/-   ##
==========================================
+ Coverage   65.67%   73.58%   +7.91%     
==========================================
  Files          65       68       +3     
  Lines        8923     9200     +277     
==========================================
+ Hits         5860     6770     +910     
+ Misses       2366     1750     -616     
+ Partials      697      680      -17     
Flag Coverage Δ
e2e-tests 45.79% <11.38%> (?)
unit-tests 66.39% <89.67%> (+0.71%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
gitops/snapshot.go 78.45% <100.00%> (+5.03%) ⬆️
pkg/tracing/emit_and_mark.go 100.00% <100.00%> (ø)
pkg/tracing/timing_spans.go 100.00% <100.00%> (ø)
snapshot/create.go 63.22% <100.00%> (+0.30%) ⬆️
.../controller/buildpipeline/buildpipeline_adapter.go 72.82% <85.71%> (+6.96%) ⬆️
...integrationpipeline/integrationpipeline_adapter.go 80.00% <90.90%> (+2.34%) ⬆️
tekton/integration_pipeline.go 85.53% <0.00%> (+2.93%) ⬆️
internal/controller/snapshot/snapshot_adapter.go 70.54% <83.33%> (+4.68%) ⬆️
release/releaseplan.go 85.71% <20.00%> (-9.94%) ⬇️
cmd/main.go 67.92% <16.66%> (+67.92%) ⬆️
... and 1 more

... and 31 files with indirect coverage changes


Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b4f7e5...659b591. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jsztuka

jsztuka commented May 18, 2026

Copy link
Copy Markdown
Contributor

Where was this requested? Is there story related to those changes? Why and what for do we need this change?

@jdcasey

jdcasey commented May 18, 2026

Copy link
Copy Markdown

@jsztuka here's the ADR that's being implemented, part of which is this PR: https://github.com/konflux-ci/architecture/blob/main/ADR/0062-distributed-tracing.md

@ci-operator ci-operator force-pushed the distributed-tracing branch from a5ecfdb to 86e8bc3 Compare May 28, 2026 16:19
@ci-operator

Copy link
Copy Markdown
Contributor Author

Thanks for the review and for flagging the gaps. Both flows are now covered - emitBuildTimingSpans() is called on every terminal exit path in both EnsureSnapshotExists (ComponentGroup model) and EnsureSnapshotExistsApplication (Application model): success, failure, and the already-annotated early return. The dedup annotation prevents double emission on requeue.

Also folded in the spec.application ReleasePlan cache index fix (#1555) since without it the Snapshot controller crashes on Application-model tenants and the trace-context propagation chain breaks at that stage.

@ci-operator

Copy link
Copy Markdown
Contributor Author

Update: folded the #1555 fix into this PR since Application-model trace propagation depends on it. #1555 can be closed.

@ci-operator ci-operator force-pushed the distributed-tracing branch from a641fe6 to f2c7151 Compare June 4, 2026 20:33
@fullsend-ai-review fullsend-ai-review Bot added requires-manual-review Review requires human judgment and removed requires-manual-review Review requires human judgment labels Jun 4, 2026
@dirgim

dirgim commented Jun 11, 2026

Copy link
Copy Markdown
Member

/ok-to-test

Comment thread cmd/main.go Fixed
@ci-operator ci-operator force-pushed the distributed-tracing branch from f2c7151 to cd6cef3 Compare June 15, 2026 13:39
@fullsend-ai-review

Copy link
Copy Markdown

🤖 Review · Started 1:40 PM UTC
Commit: ffde3b2 · View workflow run →

@ci-operator

Copy link
Copy Markdown
Contributor Author

Resolved a rebase conflict in snapshot_adapter_test.go where main's
new ComponentGroup test cases landed in the same Describe block as the
supersede-trace test. Kept main's additions and adapted the
supersede-trace test to use the sibling-snapshot mock the new supersede
path expects.

Also added a few tests around samplerFromEnv, EmitAndMarkTimingSpans,
and the build/test emit paths to cover patch and refetch failures.

@fullsend-ai-review fullsend-ai-review Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the review comment for full details.

Comment thread pkg/tracing/tracing.go Outdated
Comment thread pkg/tracing/tracing.go Outdated
Comment thread pkg/tracing/emit_and_mark.go
Comment thread internal/controller/integrationpipeline/integrationpipeline_adapter.go Outdated
Comment thread tekton/consts/consts.go Outdated
Comment thread pkg/tracing/timing_spans.go Outdated
Comment thread pkg/tracing/tracing.go Outdated
Comment thread pkg/tracing/emit_and_mark.go
Comment thread pkg/tracing/timing_spans.go
@fullsend-ai-review fullsend-ai-review Bot removed the requires-manual-review Review requires human judgment label Jun 15, 2026
@fullsend-ai-review

Copy link
Copy Markdown

🤖 Finished Review · ✅ Success · Started 1:40 PM UTC · Completed 1:56 PM UTC
Commit: ffde3b2 · View workflow run →

@ci-operator ci-operator force-pushed the distributed-tracing branch from cd6cef3 to 53aeaf6 Compare June 17, 2026 19:48
@fullsend-ai-review

fullsend-ai-review Bot commented Jun 17, 2026

Copy link
Copy Markdown

🤖 Finished Review · ✅ Success · Started 7:49 PM UTC · Completed 8:03 PM UTC
Commit: 218f229 · View workflow run →

@fullsend-ai-review fullsend-ai-review Bot added the requires-manual-review Review requires human judgment label Jun 17, 2026
Propagate W3C trace context from build PipelineRuns to Snapshots,
integration PipelineRuns, and Release CRs. Emit waitDuration and
executeDuration timing spans on completed PipelineRuns across both
ComponentGroup and Application model flows.

Tracing is opt-in via OTEL_EXPORTER_OTLP_ENDPOINT; without it set, the
service uses a noop tracer. The sampler family is selected via
OTEL_TRACES_SAMPLER.

Assisted-by: Claude Code
Signed-off-by: Josiah England <jengland@redhat.com>
@ci-operator ci-operator force-pushed the distributed-tracing branch from 53aeaf6 to 659b591 Compare June 23, 2026 18:35
@ci-operator

Copy link
Copy Markdown
Contributor Author

@dirgim - rebased on main; conflicts were additive only (loader/tracing imports, condition consts) so no behavior change since your last /ok-to-test. Mind re-running it on the new HEAD so CI picks up?

@fullsend-ai-review

fullsend-ai-review Bot commented Jun 23, 2026

Copy link
Copy Markdown

🤖 Finished Review · ✅ Success · Started 6:39 PM UTC · Completed 6:55 PM UTC
Commit: ec21706 · View workflow run →

@fullsend-ai-review fullsend-ai-review Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the review comment for full details.

Comment thread pkg/tracing/tracing.go
@fullsend-ai-review fullsend-ai-review Bot removed the requires-manual-review Review requires human judgment label Jun 23, 2026
@dirgim

dirgim commented Jun 24, 2026

Copy link
Copy Markdown
Member

/ok-to-test

@dirgim dirgim left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kasemAlem kasemAlem self-requested a review June 25, 2026 14:23
@dirgim dirgim merged commit 554730a into konflux-ci:main Jun 25, 2026
30 of 31 checks passed
@fullsend-ai-retro

fullsend-ai-retro Bot commented Jun 25, 2026

Copy link
Copy Markdown

🤖 Finished Retro · ✅ Success · Started 9:00 PM UTC · Completed 9:09 PM UTC
Commit: ec21706 · View workflow run →

@fullsend-ai-retro

Copy link
Copy Markdown

Retro Analysis: PR #1543 — Distributed Tracing

This XXL PR (2100+ lines, 23 files) was open for 2 months with 10 review agent iterations, 5 gemini-code-assist suggestions, and human review from 4 maintainers.

What went well

  • fullsend-ai-review caught a genuine HIGH security issue — the go-billy v5.8.0 downgrade that silently swallowed path-traversal errors. Author fixed immediately.
  • fullsend-ai-review correctly identified a race condition in finalizer removal, which the author addressed with RetryOnConflict.
  • gemini-code-assist's ParseFloat error handling suggestion was accepted and improved the code.
  • CI pipeline caught a useless ctx assignment via CodeQL.
  • Test coverage reached 90.04% patch coverage.

What could go better

  • The most critical bugs were caught only by a human reviewer (kasemAlem): two missing emission paths in the Application model flow of buildpipeline_adapter.go. All automated reviewers missed these across 10+ iterations because they lacked context about the repo's dual-model architecture (ComponentGroup vs Application flows).
  • 10 review iterations with repeated findings: The "missing-authorization" / Jira ticket finding appeared in 6 consecutive runs; the annotation-namespace finding persisted across runs after the author explained the ADR rationale. These are already tracked by existing upstream issues (#1013, #1500, #1672).
  • Large PR size concerns are already tracked upstream (#2118, #2034).

Proposal filed

  1. Document dual-model completeness requirement in AGENTS.md (konflux-ci/integration-service) — Adding explicit guidance that feature implementations must cover both Application and ComponentGroup flows would close the gap between human and automated review quality for this class of bug.

Proposals filed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants