-
Notifications
You must be signed in to change notification settings - Fork 74
Description
This issue supersedes the open spikes on project board #171: #1126, #1127, #1128, #1129, #1130, #1131, #1132, #1133.
Context
The artifact promoter promotes container images from staging registries (gcr.io/k8s-staging-*) to production (registry.k8s.io). This issue answers all open spikes and describes the new architecture.
Implementation Phases
- Rate limiting: Rewrite rate limiter with budget allocation and adaptive backoff #1702
Adaptive rate limiter with 429 backoff, budget allocator with named sub-budgets and rebalancing - Registry provider
registry.Providerinterface withReadRegistries/CopyImage, crane implementation, in-memory fake - Auth interfaces
auth.IdentityTokenProvider/ServiceActivatorinterfaces, GCP implementation, static/noop for testing - Vulnerability scanner
vuln.Scannerinterface,GrafeasScanner(GCP Container Analysis),NoopScanner - Provenance
provenance.Verifier/Generatorinterfaces,CosignVerifier,PromotionGenerator(SLSA v1.0),NoopVerifier - Pipeline engine
Generic pipeline engine with phases,ErrStopPipeline, pipeline phases wired as closures inpromoter.go - Remove legacy pipeline
Removed legacyRunChecks/PreCheckpipeline fromSyncContext, decoupled signing fromgcloudpackage - Remove legacy deps
ReplaceSyncContext.Promote()withregistry.Provider.CopyImage(), movePromotionEdge/ImageTagtopromoter/image/promotion/, rewrite snapshot path, inline gcloud CLI calls, delete audit/e2e/CLI legacy packages - Delete legacy monolith
Deleteinventory.go,types.go,checks.go,gcloud/,stream/,json/,reqcounter/,container/,timewrapper/
Spike Answers
#1126 — What would be an ideal length for a K8s promotion?
Data (from #1124):
- Current K8s core image promotion job: 33min+, frequently failing on signature replication (429 errors)
- For <40 images: signing = 50-75% of time, promotion = 17-35%
- For >100 images: promotion = 50-75%, signing = 15-30%
- Signature validation: consistently <3% — never the bottleneck
Root cause: The HTTP rate limiter was a global singleton shared between promotion and signing. Set to 50 req/sec with burst=1, it only limited GET/HEAD — writes were unlimited but still count against Artifact Registry quotas.
Target: <10 minutes for K8s core images. Achieved by:
- Separating rate limit budgets for promotion vs signing (70/30 split)
- Rate-limiting writes (not just reads)
- Adding adaptive backoff on 429 responses (10s backoff, 15s cooldown)
- Rebalancing budget after promotion completes (give signing 100%)
See promoter/image/ratelimit/.
#1127 — Do we need to validate signatures from staging in parallel?
No, this is not a bottleneck. Signature validation takes <3% of total time. The real performance problem was the shared rate limiter between promotion and signing. Keep signature validation as-is.
#1128 — Can we promote images not built by Google Cloud Build?
Yes, the code already supports this. The promotion pipeline uses crane (OCI-generic) for all image operations. The "GCB requirement" is an infrastructure constraint (who can push to staging registries), not a code constraint. Any image in a staging registry with a matching digest in the manifest YAML gets promoted.
GCP coupling has been abstracted behind interfaces:
auth.IdentityTokenProvider/auth.ServiceActivator— abstract OIDC and service account activationregistry.Provider— abstract registry listing and image copyingvuln.Scanner— abstract vulnerability scanning (replaces GCP-only Container Analysis)
See promoter/image/auth/, promoter/image/registry/, promoter/image/vuln/.
#1129 — How to split validating data from validating signatures?
The promotion flow is formalized into independent pipeline phases:
| Phase | Name | Behavior |
|---|---|---|
| 1 | setup | ValidateOptions, ActivateServiceAccounts, PrewarmTUFCache |
| 2 | plan | ParseManifests, GetPromotionEdges. Stops early if --parse-only. |
| 3 | provenance | Optional provenance verification via Verifier interface. Skipped when --require-provenance=false (default). |
| 4 | validate | ValidateStagingSignatures. Stops early if not --confirm (dry-run). |
| 5 | promote | Copy images. Rebalances rate budget to give signing 100% capacity. |
| 6 | sign | Cosign signing + signature replication. |
| 7 | attest | Copy pre-generated SBOMs from staging to production, generate promotion provenance. |
Each phase gets its own rate limit budget and error handling. The pipeline engine is generic (promoter/image/pipeline/) — phases are closures in promoter.go that capture shared state.
#1130 — Security risks of breaking down the image-promoter
-
Unsigned window: Images exist in production unsigned between the promote and sign phases.
- Mitigation: Signing runs immediately after promotion with budget rebalancing. The window is minutes, not hours. This is already the current behavior.
-
Partial failure recovery: If promotion succeeds but signing fails, some images are unsigned.
- Mitigation: Signing is idempotent —
SignImages()skips images with existing signatures. A follow-up signing job completes the work.
- Mitigation: Signing is idempotent —
-
Credential scope: Use a single
auth.IdentityTokenProviderinjected into all phases. -
Race conditions:
crane.Copyis idempotent (digest-based). Two promoters copying the same digest are harmless. -
Supply chain: All phases run in the same process. The pipeline is an in-process abstraction, not a distributed system.
#1131 — Do we verify image digest and does the reference exist?
- Digest format: Validated via regex
^sha256:[0-9a-f]{64}$ - Reference existence: Checked during inventory — staging registry is read and digest existence is confirmed
- Provenance: Optional verification via
provenance.Verifierinterface.CosignVerifierchecks SLSA attestation tags and verifies builder/source repo against policy. Enabled with--require-provenance. - Vulnerability scanning: Optional via
vuln.Scannerinterface.GrafeasScannerwraps GCP Container Analysis;NoopScannerfor non-GCP. Controlled by--vuln-severity-threshold.
#1132 — Formalize the artifact validation process
The formal process maps directly to the pipeline phases:
1. MANIFEST VALIDATION (plan phase)
- Parse YAML manifest, validate digest/tag format, registry names, overlapping edges
2. INVENTORY CHECK (plan phase)
- Read staging/production registries, compute promotion edges (set difference)
3. PROVENANCE VERIFICATION (provenance phase, optional)
- Check SLSA attestation on staging images
- Verify builder identity and source repo against allowed lists
4. SIGNATURE VALIDATION (validate phase)
- Verify cosign signatures exist on staging images
5. VULNERABILITY SCANNING (optional, via vuln.Scanner interface)
- Scan staging images for CVEs above severity threshold
6. PROMOTION (promote phase)
- Copy images from staging to production, rate-limited
7. SIGNING (sign phase)
- Sign promoted images, replicate signatures to mirrors, rate-limited
8. ATTESTATION (attest phase)
- Copy SBOMs from staging to production (cosign tag convention)
- Generate SLSA v1.0 promotion provenance (--generate-promotion-provenance)
Steps 3, 4, 5 are opt-in gates.
#1133 — User research
Survey was conducted by @lasomethingsomething but conclusions not drawn. Key user categories:
- Hyperscalers (build from source, less concerned with promotion integrity)
- Custom installer builders (KubeSpray etc., need signed images)
- End users building K8s environments (need trust chain)
- Sovereign cloud operators (need SLSA compliance)
All new validation gates default to off for backwards compatibility.
Architecture
graph TD
P["promoter.go (orchestrator)"]
P --> PI["PromoteImages()"]
P --> SN["Snapshot()"]
P --> SS["SecurityScan()"]
PI --> PE["Pipeline Engine"]
PE --> S1["setup"]
PE --> S2["plan"]
PE --> S3["provenance"]
PE --> S4["validate"]
PE --> S5["promote"]
PE --> S6["sign"]
PE --> S7["attest"]
PI & SN & SS --> IMPL["promoterImplementation interface"]
IMPL --> RP["registry.Provider"]
IMPL --> AUTH["auth.IdentityTokenProvider + ServiceActivator"]
IMPL --> VS["vuln.Scanner"]
RP --> CRANE["CraneProvider (go-containerregistry)"]
AUTH --> GCP["GCP Auth (gcloud CLI)"]
VS --> GRAFEAS["GrafeasScanner (Container Analysis)"]
IMPL -.-> RL["ratelimit.Budget (70/30 promote/sign)"]
IMPL -.-> PROV["provenance.Verifier + Generator (SLSA v1.0)"]
Shared types kept from legacy: schema.Manifest, registry.RegInvImage, registry.Context
Documentation
The existing docs in docs/ have been updated to reflect the rewrite:
docs/image-promotion.md— OCI-generic support, pipeline phases, CLI flags (--require-provenance,--allowed-builders,--allowed-source-repos,--generate-promotion-provenance), rate limiting, signing, SBOM copying, provenance generation.docs/checks.md—vuln.Scannerinterface, portable severity model,--vuln-severity-threshold.