Skip to content

Rewrite artifact promoter pipeline #1701

@saschagrunert

Description

@saschagrunert

This issue supersedes the open spikes on project board #171: #1126, #1127, #1128, #1129, #1130, #1131, #1132, #1133.

Context

The artifact promoter promotes container images from staging registries (gcr.io/k8s-staging-*) to production (registry.k8s.io). This issue answers all open spikes and describes the new architecture.


Implementation Phases

  • Rate limiting: Rewrite rate limiter with budget allocation and adaptive backoff #1702
    Adaptive rate limiter with 429 backoff, budget allocator with named sub-budgets and rebalancing
  • Registry provider
    registry.Provider interface with ReadRegistries/CopyImage, crane implementation, in-memory fake
  • Auth interfaces
    auth.IdentityTokenProvider/ServiceActivator interfaces, GCP implementation, static/noop for testing
  • Vulnerability scanner
    vuln.Scanner interface, GrafeasScanner (GCP Container Analysis), NoopScanner
  • Provenance
    provenance.Verifier/Generator interfaces, CosignVerifier, PromotionGenerator (SLSA v1.0), NoopVerifier
  • Pipeline engine
    Generic pipeline engine with phases, ErrStopPipeline, pipeline phases wired as closures in promoter.go
  • Remove legacy pipeline
    Removed legacy RunChecks/PreCheck pipeline from SyncContext, decoupled signing from gcloud package
  • Remove legacy deps
    Replace SyncContext.Promote() with registry.Provider.CopyImage(), move PromotionEdge/ImageTag to promoter/image/promotion/, rewrite snapshot path, inline gcloud CLI calls, delete audit/e2e/CLI legacy packages
  • Delete legacy monolith
    Delete inventory.go, types.go, checks.go, gcloud/, stream/, json/, reqcounter/, container/, timewrapper/

Spike Answers

#1126 — What would be an ideal length for a K8s promotion?

Data (from #1124):

  • Current K8s core image promotion job: 33min+, frequently failing on signature replication (429 errors)
  • For <40 images: signing = 50-75% of time, promotion = 17-35%
  • For >100 images: promotion = 50-75%, signing = 15-30%
  • Signature validation: consistently <3% — never the bottleneck

Root cause: The HTTP rate limiter was a global singleton shared between promotion and signing. Set to 50 req/sec with burst=1, it only limited GET/HEAD — writes were unlimited but still count against Artifact Registry quotas.

Target: <10 minutes for K8s core images. Achieved by:

  1. Separating rate limit budgets for promotion vs signing (70/30 split)
  2. Rate-limiting writes (not just reads)
  3. Adding adaptive backoff on 429 responses (10s backoff, 15s cooldown)
  4. Rebalancing budget after promotion completes (give signing 100%)

See promoter/image/ratelimit/.

#1127 — Do we need to validate signatures from staging in parallel?

No, this is not a bottleneck. Signature validation takes <3% of total time. The real performance problem was the shared rate limiter between promotion and signing. Keep signature validation as-is.

#1128 — Can we promote images not built by Google Cloud Build?

Yes, the code already supports this. The promotion pipeline uses crane (OCI-generic) for all image operations. The "GCB requirement" is an infrastructure constraint (who can push to staging registries), not a code constraint. Any image in a staging registry with a matching digest in the manifest YAML gets promoted.

GCP coupling has been abstracted behind interfaces:

  • auth.IdentityTokenProvider / auth.ServiceActivator — abstract OIDC and service account activation
  • registry.Provider — abstract registry listing and image copying
  • vuln.Scanner — abstract vulnerability scanning (replaces GCP-only Container Analysis)

See promoter/image/auth/, promoter/image/registry/, promoter/image/vuln/.

#1129 — How to split validating data from validating signatures?

The promotion flow is formalized into independent pipeline phases:

Phase Name Behavior
1 setup ValidateOptions, ActivateServiceAccounts, PrewarmTUFCache
2 plan ParseManifests, GetPromotionEdges. Stops early if --parse-only.
3 provenance Optional provenance verification via Verifier interface. Skipped when --require-provenance=false (default).
4 validate ValidateStagingSignatures. Stops early if not --confirm (dry-run).
5 promote Copy images. Rebalances rate budget to give signing 100% capacity.
6 sign Cosign signing + signature replication.
7 attest Copy pre-generated SBOMs from staging to production, generate promotion provenance.

Each phase gets its own rate limit budget and error handling. The pipeline engine is generic (promoter/image/pipeline/) — phases are closures in promoter.go that capture shared state.

#1130 — Security risks of breaking down the image-promoter

  1. Unsigned window: Images exist in production unsigned between the promote and sign phases.

    • Mitigation: Signing runs immediately after promotion with budget rebalancing. The window is minutes, not hours. This is already the current behavior.
  2. Partial failure recovery: If promotion succeeds but signing fails, some images are unsigned.

    • Mitigation: Signing is idempotent — SignImages() skips images with existing signatures. A follow-up signing job completes the work.
  3. Credential scope: Use a single auth.IdentityTokenProvider injected into all phases.

  4. Race conditions: crane.Copy is idempotent (digest-based). Two promoters copying the same digest are harmless.

  5. Supply chain: All phases run in the same process. The pipeline is an in-process abstraction, not a distributed system.

#1131 — Do we verify image digest and does the reference exist?

  • Digest format: Validated via regex ^sha256:[0-9a-f]{64}$
  • Reference existence: Checked during inventory — staging registry is read and digest existence is confirmed
  • Provenance: Optional verification via provenance.Verifier interface. CosignVerifier checks SLSA attestation tags and verifies builder/source repo against policy. Enabled with --require-provenance.
  • Vulnerability scanning: Optional via vuln.Scanner interface. GrafeasScanner wraps GCP Container Analysis; NoopScanner for non-GCP. Controlled by --vuln-severity-threshold.

#1132 — Formalize the artifact validation process

The formal process maps directly to the pipeline phases:

1. MANIFEST VALIDATION (plan phase)
   - Parse YAML manifest, validate digest/tag format, registry names, overlapping edges

2. INVENTORY CHECK (plan phase)
   - Read staging/production registries, compute promotion edges (set difference)

3. PROVENANCE VERIFICATION (provenance phase, optional)
   - Check SLSA attestation on staging images
   - Verify builder identity and source repo against allowed lists

4. SIGNATURE VALIDATION (validate phase)
   - Verify cosign signatures exist on staging images

5. VULNERABILITY SCANNING (optional, via vuln.Scanner interface)
   - Scan staging images for CVEs above severity threshold

6. PROMOTION (promote phase)
   - Copy images from staging to production, rate-limited

7. SIGNING (sign phase)
   - Sign promoted images, replicate signatures to mirrors, rate-limited

8. ATTESTATION (attest phase)
   - Copy SBOMs from staging to production (cosign tag convention)
   - Generate SLSA v1.0 promotion provenance (--generate-promotion-provenance)

Steps 3, 4, 5 are opt-in gates.

#1133 — User research

Survey was conducted by @lasomethingsomething but conclusions not drawn. Key user categories:

  • Hyperscalers (build from source, less concerned with promotion integrity)
  • Custom installer builders (KubeSpray etc., need signed images)
  • End users building K8s environments (need trust chain)
  • Sovereign cloud operators (need SLSA compliance)

All new validation gates default to off for backwards compatibility.


Architecture

graph TD
    P["promoter.go (orchestrator)"]
    P --> PI["PromoteImages()"]
    P --> SN["Snapshot()"]
    P --> SS["SecurityScan()"]

    PI --> PE["Pipeline Engine"]
    PE --> S1["setup"]
    PE --> S2["plan"]
    PE --> S3["provenance"]
    PE --> S4["validate"]
    PE --> S5["promote"]
    PE --> S6["sign"]
    PE --> S7["attest"]

    PI & SN & SS --> IMPL["promoterImplementation interface"]

    IMPL --> RP["registry.Provider"]
    IMPL --> AUTH["auth.IdentityTokenProvider + ServiceActivator"]
    IMPL --> VS["vuln.Scanner"]

    RP --> CRANE["CraneProvider (go-containerregistry)"]
    AUTH --> GCP["GCP Auth (gcloud CLI)"]
    VS --> GRAFEAS["GrafeasScanner (Container Analysis)"]

    IMPL -.-> RL["ratelimit.Budget (70/30 promote/sign)"]
    IMPL -.-> PROV["provenance.Verifier + Generator (SLSA v1.0)"]
Loading

Shared types kept from legacy: schema.Manifest, registry.RegInvImage, registry.Context

Documentation

The existing docs in docs/ have been updated to reflect the rewrite:

  • docs/image-promotion.md — OCI-generic support, pipeline phases, CLI flags (--require-provenance, --allowed-builders, --allowed-source-repos, --generate-promotion-provenance), rate limiting, signing, SBOM copying, provenance generation.
  • docs/checks.mdvuln.Scanner interface, portable severity model, --vuln-severity-threshold.

Metadata

Metadata

Assignees

Labels

area/release-engIssues or PRs related to the Release Engineering subprojectkind/featureCategorizes issue or PR as related to a new feature.lifecycle/activeIndicates that an issue or PR is actively being worked on by a contributor.sig/releaseCategorizes an issue or PR as relevant to SIG Release.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions