Introduce ADR: Schema-governed correlation ID for workflows #280

ci-operator · 2025-11-17T23:37:48Z

This ADR introduces a schema-governed correlation ID for tracking workflows across multiple stages in Konflux, addressing the need for a deterministic identifier spanning multiple Tekton PipelineRuns. It outlines the generation and propagation of correlation IDs (ECID and SCID) without modifying existing CRDs, ensuring compatibility and observability.

ralphbean · 2025-11-18T12:49:38Z

ADR/0056-schema-goverened-correlation-id.md

+Two identifiers are defined:
+
+**Event Correlation ID (ECID).**
+Generated by Pipelines-as-Code (PaC) per webhook event. Each namespace may define a `konflux-correlation-policy` Secret containing a UUIDv5 namespace, a selected set of normalized SCM fields (provider, repository, commit, PR number, event ID, etc.), and a canonical JSON template. PaC watches and caches this policy; for each webhook event, it normalizes the selected fields, renders the template, computes a UUIDv5 hash, and labels all PipelineRuns created for that event with the resulting ECID.


Have you proposed the correlation policy feature to the pipelines-as-code maintainers yet? This depends on a new feature in PaC which is a project independent of the konflux-ci community umbrella. At minimum, we'll need their maintainers feedback here. Even better would be if we can point to a discussion in their project, upstream from Konflux.

Thanks for pointing me in the right direction! I have proposed this concept to @vdemeester for initial review - with it now better defined, I've started the discussion in the PaC repository.

ralphbean · 2025-11-18T12:51:22Z

ADR/0056-schema-goverened-correlation-id.md

+Two identifiers are defined:
+
+**Event Correlation ID (ECID).**
+Generated by Pipelines-as-Code (PaC) per webhook event. Each namespace may define a `konflux-correlation-policy` Secret containing a UUIDv5 namespace, a selected set of normalized SCM fields (provider, repository, commit, PR number, event ID, etc.), and a canonical JSON template. PaC watches and caches this policy; for each webhook event, it normalizes the selected fields, renders the template, computes a UUIDv5 hash, and labels all PipelineRuns created for that event with the resulting ECID.


I assume that we cannot expect PaC to hardcode the name konflux-correlation-policy. You could propose adding a new field to the .spec of the Repository resource that declares the name of a Secret that should be used by PaC to find the correlation policy.

If that correlation policy is not defined, what should the behavior be? No ECIDs created?

Good call – setting this in the Repository spec (e.g. spec.correlationPolicy.secretRef.name) is the right place, even if it breaks our “no CRD modifications” guarantee 😅. That also lets each Repository pick its own correlation policy, which is useful both for mixed SCM providers (different webhook payload shapes) and for teams that want different correlation semantics (e.g. per-webhook vs per-PR+commit) even within the same namespace.

If that field isn’t set, PaC simply doesn’t label PipelineRuns with ECIDs for that repository. If it is set but the Secret is missing or invalid, PaC treats correlation as disabled for that repo and logs a warning rather than silently falling back to a default policy. Such implementation-specifics may be detailed a bit more in the PaC-specific discussion.

Longer term, I imagine PaC could support a cluster-level default policy that applies only when the Repository doesn’t specify one and correlation is enabled globally, but that feels additive and not required for a first iteration.

ralphbean · 2025-11-18T12:52:37Z

ADR/0056-schema-goverened-correlation-id.md

+
+Konflux CI orchestrates multi-stage workflows involving Build PipelineRuns for individual `Component` resources, followed by creation and reconciliation of `Snapshot` resources, execution of `IntegrationTestScenario` PipelineRuns, and finally Release CRs and Release PipelineRuns. Although these steps represent a single logical workflow, Konflux currently lacks a deterministic way to correlate all PipelineRuns generated by the same webhook event or logical change.
+
+Tekton’s built-in identifiers (such as event IDs) are random and cannot be reproduced. CEL templates can extract event metadata but cannot compute stable hashes. PipelineRuns may be short-lived or created in remote clusters, preventing reliable correlation based on their presence in the cluster. Downstream observability systems that consume PipelineRun CloudEvents require a durable, deterministic identifier spanning the full Build -> Integration -> Release lifecycle.


Why is determinism important to the use case?

Determinism matters here because the schema lets each policy determine what one “unit of work” is and get a stable correlation key for it. For example, if the schema uses {provider, repo, event_id} we get a per-webhook key; if it uses {provider, repo, pr_number, commit_sha} we get a per-PR-and-commit key; if it uses {provider, repo, branch, commit_sha} we get a per-commit-on-branch key – and in each case the same inputs always produce the same ID, unlike a random value like tekton.dev/triggers-eventid that's different on each webhook event.

Because that key is a deterministic function of fields that can include the commit, you can safely treat “group by correlation-id” as “group by this change/delivery” and reliably join to commit timestamps to compute per-change or per-delivery durations and SLO-style metrics. Basically, by deterministically keying on, for example, a "per-PR-and-commit" of SCM events that have that have that commit timestamp, we can reliably use that timestamp for exposing metric spans which begin at commit time.

I've added this context into the ADR - does this help clarify the concept?

arewm · 2025-11-19T03:35:26Z

ADR/0056-schema-goverened-correlation-id.md

+Generated by Pipelines-as-Code (PaC) per webhook event. Each namespace may define a `konflux-correlation-policy` Secret containing a UUIDv5 namespace, a selected set of normalized SCM fields (provider, repository, commit, PR number, event ID, etc.), and a canonical JSON template. PaC watches and caches this policy; for each webhook event, it normalizes the selected fields, renders the template, computes a UUIDv5 hash, and labels all PipelineRuns created for that event with the resulting ECID.
+
+**Snapshot Correlation ID (SCID).**
+Generated by Integration when a Snapshot contains components produced from multiple ECIDs. Build-service records the ECID in each Component’s build metadata when a Build PipelineRun completes. When Integration creates a Snapshot, it copies these ECIDs into the Snapshot’s component list. On reconciliation, if all components share the same ECID, that value becomes the Snapshot’s correlation ID; if multiple ECIDs are present, Integration computes a UUIDv5 over the sorted distinct ECIDs and uses it as the SCID. Parent ECIDs are stored in paged Snapshot annotations along with page count, total count, and a fingerprint.


Build-service records the ECID in each Component’s build metadata when a Build PipelineRun completes.

What build metadata are you referring to? The build service doesn't propagate anything to resources on the cluster. It is actually the integration service which updates the Component CR with the latest known image build, but this is only so that it knows what references to use for the Snapshot generation.

You’re right - build-service doesn’t, itself, write to cluster resources. The ADR text was imprecise about this and I've updated it for clarity. The actual flow we’re proposing is when a Build PipelineRun completes, integration-service reads the correlation-id from it and writes it (the ECID) into the Component CR as an annotation.

As for extending the Component's "only" use, that's a fair concern: today the Component is just the orchestration pointer to the latest image, and what we’re proposing is a very small, aligned extension of that role - by adding the ECID for that same latest build we’re not turning Component into a provenance database, we’re just letting the object that already declares which build is used for Snapshots also carry the one extra piece of correlation metadata that makes it usable as a live observation hook for exposing build->snapshot->release flows and SLO-ish metrics.

arewm · 2025-11-19T03:41:43Z

ADR/0056-schema-goverened-correlation-id.md

+Generated by Pipelines-as-Code (PaC) per webhook event. Each namespace may define a `konflux-correlation-policy` Secret containing a UUIDv5 namespace, a selected set of normalized SCM fields (provider, repository, commit, PR number, event ID, etc.), and a canonical JSON template. PaC watches and caches this policy; for each webhook event, it normalizes the selected fields, renders the template, computes a UUIDv5 hash, and labels all PipelineRuns created for that event with the resulting ECID.
+
+**Snapshot Correlation ID (SCID).**
+Generated by Integration when a Snapshot contains components produced from multiple ECIDs. Build-service records the ECID in each Component’s build metadata when a Build PipelineRun completes. When Integration creates a Snapshot, it copies these ECIDs into the Snapshot’s component list. On reconciliation, if all components share the same ECID, that value becomes the Snapshot’s correlation ID; if multiple ECIDs are present, Integration computes a UUIDv5 over the sorted distinct ECIDs and uses it as the SCID. Parent ECIDs are stored in paged Snapshot annotations along with page count, total count, and a fingerprint.


On reconciliation, if all components share the same ECID, that value becomes the Snapshot’s correlation ID; if multiple ECIDs are present, Integration computes a UUIDv5 over the sorted distinct ECIDs and uses it as the SCID. Parent ECIDs are stored in paged Snapshot annotations along with page count, total count, and a fingerprint.

The integration service has some functionality for creating group snapshots for all components built from the same commit which feels similar to some of the desired functionality here. This is generally not data that the integration service has available to it since its only state is the set of latest Component build as recorded in the Component CR.

How do you envision that the ECIDs would be stored in the Snapshot? We don't have a database available to ensure that we have all possible Snapshots available. If you are proposing to using existing CRs for this data, we are generally trying to reduce the amount of data stored in etcd across all controllers. This might be problematic for getting some of this data unless the integration service is aware of KubeArchive and can fetch data out of it.

The intent is that Integration still only uses the state it has today: the latest build info on Component plus the current Snapshot. When a build PipelineRun completes, Integration reads the ECID from the PipelineRun and writes it as an annotation on the corresponding Component; when a Snapshot is reconciled, it looks up those Components, reads their ECIDs, computes the distinct ECID set for that Snapshot only, and either reuses the single ECID or derives an SCID and writes the SCID plus the parent ECID list as paged annotations on that Snapshot. There’s no extra database, no need to discover “all Snapshots”, and no KubeArchive lookup in the reconcile path.

On the etcd side, this is bounded to one correlation label per object and parent ECID annotations only on multi-origin Snapshots and Release PipelineRuns, with paging over the distinct ECIDs to stay under a configurable size cap. KubeArchive is still useful for long-term archival, but for this we need correlation metadata on live CRs and PipelineRun events so we can do near-real-time aggregation and alerting, which makes depending on an eventually consistent archive in the reconcile loop a non-starter.

…nts with ECID annotations.

… field hash

ci-operator requested a review from a team as a code owner November 17, 2025 23:37

ralphbean reviewed Nov 18, 2025

View reviewed changes

arewm mentioned this pull request Nov 18, 2025

Proposal: Schema-Governed Correlation ID Propagation konflux-ci/community#44

Closed

arewm reviewed Nov 19, 2025

View reviewed changes

ci-operator added 3 commits November 20, 2025 08:41

Clarify that not build-service but integration-service pathes Compone…

cb4a4a8

…nts with ECID annotations.

Name Konflux services explicitly

d6b097c

Detail rationale for defining 'unit of work' event with deterministic…

4e1c6b1

… field hash

arewm modified the milestone: 2025-12-04 Dec 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce ADR: Schema-governed correlation ID for workflows #280

Introduce ADR: Schema-governed correlation ID for workflows #280

Uh oh!

ci-operator commented Nov 17, 2025

Uh oh!

ralphbean Nov 18, 2025

Uh oh!

ci-operator Nov 21, 2025

Uh oh!

ralphbean Nov 18, 2025

Uh oh!

ci-operator Nov 21, 2025

Uh oh!

ralphbean Nov 18, 2025

Uh oh!

ci-operator Nov 21, 2025

Uh oh!

arewm Nov 19, 2025

Uh oh!

ci-operator Nov 21, 2025

Uh oh!

arewm Nov 19, 2025

Uh oh!

ci-operator Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		Konflux CI orchestrates multi-stage workflows involving Build PipelineRuns for individual `Component` resources, followed by creation and reconciliation of `Snapshot` resources, execution of `IntegrationTestScenario` PipelineRuns, and finally Release CRs and Release PipelineRuns. Although these steps represent a single logical workflow, Konflux currently lacks a deterministic way to correlate all PipelineRuns generated by the same webhook event or logical change.

		Tekton’s built-in identifiers (such as event IDs) are random and cannot be reproduced. CEL templates can extract event metadata but cannot compute stable hashes. PipelineRuns may be short-lived or created in remote clusters, preventing reliable correlation based on their presence in the cluster. Downstream observability systems that consume PipelineRun CloudEvents require a durable, deterministic identifier spanning the full Build -> Integration -> Release lifecycle.

Introduce ADR: Schema-governed correlation ID for workflows #280

Are you sure you want to change the base?

Introduce ADR: Schema-governed correlation ID for workflows #280

Uh oh!

Conversation

ci-operator commented Nov 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants