-
Notifications
You must be signed in to change notification settings - Fork 78
Introduce ADR: Schema-governed correlation ID for workflows #280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This ADR introduces a schema-governed correlation ID for tracking workflows across multiple stages in Konflux, addressing the need for a deterministic identifier spanning multiple Tekton PipelineRuns. It outlines the generation and propagation of correlation IDs (ECID and SCID) without modifying existing CRDs, ensuring compatibility and observability.
| Two identifiers are defined: | ||
|
|
||
| **Event Correlation ID (ECID).** | ||
| Generated by Pipelines-as-Code (PaC) per webhook event. Each namespace may define a `konflux-correlation-policy` Secret containing a UUIDv5 namespace, a selected set of normalized SCM fields (provider, repository, commit, PR number, event ID, etc.), and a canonical JSON template. PaC watches and caches this policy; for each webhook event, it normalizes the selected fields, renders the template, computes a UUIDv5 hash, and labels all PipelineRuns created for that event with the resulting ECID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you proposed the correlation policy feature to the pipelines-as-code maintainers yet? This depends on a new feature in PaC which is a project independent of the konflux-ci community umbrella. At minimum, we'll need their maintainers feedback here. Even better would be if we can point to a discussion in their project, upstream from Konflux.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing me in the right direction! I have proposed this concept to @vdemeester for initial review - with it now better defined, I've started the discussion in the PaC repository.
| Two identifiers are defined: | ||
|
|
||
| **Event Correlation ID (ECID).** | ||
| Generated by Pipelines-as-Code (PaC) per webhook event. Each namespace may define a `konflux-correlation-policy` Secret containing a UUIDv5 namespace, a selected set of normalized SCM fields (provider, repository, commit, PR number, event ID, etc.), and a canonical JSON template. PaC watches and caches this policy; for each webhook event, it normalizes the selected fields, renders the template, computes a UUIDv5 hash, and labels all PipelineRuns created for that event with the resulting ECID. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that we cannot expect PaC to hardcode the name konflux-correlation-policy. You could propose adding a new field to the .spec of the Repository resource that declares the name of a Secret that should be used by PaC to find the correlation policy.
If that correlation policy is not defined, what should the behavior be? No ECIDs created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call – setting this in the Repository spec (e.g. spec.correlationPolicy.secretRef.name) is the right place, even if it breaks our “no CRD modifications” guarantee 😅. That also lets each Repository pick its own correlation policy, which is useful both for mixed SCM providers (different webhook payload shapes) and for teams that want different correlation semantics (e.g. per-webhook vs per-PR+commit) even within the same namespace.
If that field isn’t set, PaC simply doesn’t label PipelineRuns with ECIDs for that repository. If it is set but the Secret is missing or invalid, PaC treats correlation as disabled for that repo and logs a warning rather than silently falling back to a default policy. Such implementation-specifics may be detailed a bit more in the PaC-specific discussion.
Longer term, I imagine PaC could support a cluster-level default policy that applies only when the Repository doesn’t specify one and correlation is enabled globally, but that feels additive and not required for a first iteration.
|
|
||
| Konflux CI orchestrates multi-stage workflows involving Build PipelineRuns for individual `Component` resources, followed by creation and reconciliation of `Snapshot` resources, execution of `IntegrationTestScenario` PipelineRuns, and finally Release CRs and Release PipelineRuns. Although these steps represent a single logical workflow, Konflux currently lacks a deterministic way to correlate all PipelineRuns generated by the same webhook event or logical change. | ||
|
|
||
| Tekton’s built-in identifiers (such as event IDs) are random and cannot be reproduced. CEL templates can extract event metadata but cannot compute stable hashes. PipelineRuns may be short-lived or created in remote clusters, preventing reliable correlation based on their presence in the cluster. Downstream observability systems that consume PipelineRun CloudEvents require a durable, deterministic identifier spanning the full Build -> Integration -> Release lifecycle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is determinism important to the use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Determinism matters here because the schema lets each policy determine what one “unit of work” is and get a stable correlation key for it. For example, if the schema uses {provider, repo, event_id} we get a per-webhook key; if it uses {provider, repo, pr_number, commit_sha} we get a per-PR-and-commit key; if it uses {provider, repo, branch, commit_sha} we get a per-commit-on-branch key – and in each case the same inputs always produce the same ID, unlike a random value like tekton.dev/triggers-eventid that's different on each webhook event.
Because that key is a deterministic function of fields that can include the commit, you can safely treat “group by correlation-id” as “group by this change/delivery” and reliably join to commit timestamps to compute per-change or per-delivery durations and SLO-style metrics. Basically, by deterministically keying on, for example, a "per-PR-and-commit" of SCM events that have that have that commit timestamp, we can reliably use that timestamp for exposing metric spans which begin at commit time.
I've added this context into the ADR - does this help clarify the concept?
| Generated by Pipelines-as-Code (PaC) per webhook event. Each namespace may define a `konflux-correlation-policy` Secret containing a UUIDv5 namespace, a selected set of normalized SCM fields (provider, repository, commit, PR number, event ID, etc.), and a canonical JSON template. PaC watches and caches this policy; for each webhook event, it normalizes the selected fields, renders the template, computes a UUIDv5 hash, and labels all PipelineRuns created for that event with the resulting ECID. | ||
|
|
||
| **Snapshot Correlation ID (SCID).** | ||
| Generated by Integration when a Snapshot contains components produced from multiple ECIDs. Build-service records the ECID in each Component’s build metadata when a Build PipelineRun completes. When Integration creates a Snapshot, it copies these ECIDs into the Snapshot’s component list. On reconciliation, if all components share the same ECID, that value becomes the Snapshot’s correlation ID; if multiple ECIDs are present, Integration computes a UUIDv5 over the sorted distinct ECIDs and uses it as the SCID. Parent ECIDs are stored in paged Snapshot annotations along with page count, total count, and a fingerprint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build-service records the ECID in each Component’s build metadata when a Build PipelineRun completes.
What build metadata are you referring to? The build service doesn't propagate anything to resources on the cluster. It is actually the integration service which updates the Component CR with the latest known image build, but this is only so that it knows what references to use for the Snapshot generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You’re right - build-service doesn’t, itself, write to cluster resources. The ADR text was imprecise about this and I've updated it for clarity. The actual flow we’re proposing is when a Build PipelineRun completes, integration-service reads the correlation-id from it and writes it (the ECID) into the Component CR as an annotation.
As for extending the Component's "only" use, that's a fair concern: today the Component is just the orchestration pointer to the latest image, and what we’re proposing is a very small, aligned extension of that role - by adding the ECID for that same latest build we’re not turning Component into a provenance database, we’re just letting the object that already declares which build is used for Snapshots also carry the one extra piece of correlation metadata that makes it usable as a live observation hook for exposing build->snapshot->release flows and SLO-ish metrics.
| Generated by Pipelines-as-Code (PaC) per webhook event. Each namespace may define a `konflux-correlation-policy` Secret containing a UUIDv5 namespace, a selected set of normalized SCM fields (provider, repository, commit, PR number, event ID, etc.), and a canonical JSON template. PaC watches and caches this policy; for each webhook event, it normalizes the selected fields, renders the template, computes a UUIDv5 hash, and labels all PipelineRuns created for that event with the resulting ECID. | ||
|
|
||
| **Snapshot Correlation ID (SCID).** | ||
| Generated by Integration when a Snapshot contains components produced from multiple ECIDs. Build-service records the ECID in each Component’s build metadata when a Build PipelineRun completes. When Integration creates a Snapshot, it copies these ECIDs into the Snapshot’s component list. On reconciliation, if all components share the same ECID, that value becomes the Snapshot’s correlation ID; if multiple ECIDs are present, Integration computes a UUIDv5 over the sorted distinct ECIDs and uses it as the SCID. Parent ECIDs are stored in paged Snapshot annotations along with page count, total count, and a fingerprint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On reconciliation, if all components share the same ECID, that value becomes the Snapshot’s correlation ID; if multiple ECIDs are present, Integration computes a UUIDv5 over the sorted distinct ECIDs and uses it as the SCID. Parent ECIDs are stored in paged Snapshot annotations along with page count, total count, and a fingerprint.
The integration service has some functionality for creating group snapshots for all components built from the same commit which feels similar to some of the desired functionality here. This is generally not data that the integration service has available to it since its only state is the set of latest Component build as recorded in the Component CR.
How do you envision that the ECIDs would be stored in the Snapshot? We don't have a database available to ensure that we have all possible Snapshots available. If you are proposing to using existing CRs for this data, we are generally trying to reduce the amount of data stored in etcd across all controllers. This might be problematic for getting some of this data unless the integration service is aware of KubeArchive and can fetch data out of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent is that Integration still only uses the state it has today: the latest build info on Component plus the current Snapshot. When a build PipelineRun completes, Integration reads the ECID from the PipelineRun and writes it as an annotation on the corresponding Component; when a Snapshot is reconciled, it looks up those Components, reads their ECIDs, computes the distinct ECID set for that Snapshot only, and either reuses the single ECID or derives an SCID and writes the SCID plus the parent ECID list as paged annotations on that Snapshot. There’s no extra database, no need to discover “all Snapshots”, and no KubeArchive lookup in the reconcile path.
On the etcd side, this is bounded to one correlation label per object and parent ECID annotations only on multi-origin Snapshots and Release PipelineRuns, with paging over the distinct ECIDs to stay under a configurable size cap. KubeArchive is still useful for long-term archival, but for this we need correlation metadata on live CRs and PipelineRun events so we can do near-real-time aggregation and alerting, which makes depending on an eventually consistent archive in the reconcile loop a non-starter.
This ADR introduces a schema-governed correlation ID for tracking workflows across multiple stages in Konflux, addressing the need for a deterministic identifier spanning multiple Tekton PipelineRuns. It outlines the generation and propagation of correlation IDs (ECID and SCID) without modifying existing CRDs, ensuring compatibility and observability.