Conversation
Consolidated Tests Results 2026-04-02 - 11:35:16Test ResultsDetails
test-reporter: Run #3724
🎉 All tests passed!TestsView All Tests
🍂 No flaky tests in this run. Github Test Reporter by CTRF 💚 🔄 This comment has been updated |
…ai/kms into fred/ci/testing-rolling-update-mix
maksymsur
left a comment
There was a problem hiding this comment.
LGTM! Interesting approach
There was a problem hiding this comment.
Pull request overview
Adds a GitHub Actions workflow and supporting CI scripts to exercise mixed-version rolling upgrades for the thresholdWithEnclave KMS deployment, including per-party AWS KMS policy updates and TLS attestation compatibility.
Changes:
- Introduce a dedicated
rolling-upgrade-testingGitHub Actions workflow that deploys an all-old cluster, performs two upgrade waves (5/13 then 9/13), and runs Argo perf tests in mixed-version states. - Add
ci/scripts/rolling_upgrade.shplus new helper functions to update TKMS infra (per-party PCR0) and Helm-release upgrades with dualtrustedReleases. - Update perf-testing values and workflow documentation to support the new rolling-upgrade test path.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
ci/scripts/rolling_upgrade.sh |
New script orchestrating PCR extraction + TKMS infra update + selective Helm upgrades for rolling upgrade tests. |
ci/scripts/lib/kms_deployment.sh |
Adds upgrade_parties() to update trustedReleases cluster-wide and apply per-party tag/chart selection. |
ci/scripts/lib/infrastructure.sh |
Adds TKMS infra upgrade helper and readiness wait for per-party key-policy reconciliation. |
ci/scripts/deploy.sh |
Adjusts deploy logging to print TLS value. |
ci/perf-testing/thresholdWithEnclave/kms-ci/kms-service/values-kms-service-init-kms-enclave-ci.yaml |
Adds kubeUtils image configuration needed by the perf deployment. |
ci/perf-testing/thresholdWithEnclave/kms-ci/kms-service/values-kms-service-gen-keys-kms-enclave-ci.yaml |
Adds kubeUtils image configuration needed by the perf deployment. |
ci/perf-testing/thresholdWithEnclave/kms-ci/kms-service/values-kms-enclave-ci.yaml |
Adds kubeUtils image configuration needed by the perf deployment. |
ci/perf-testing/argo-workflow/thresholdWithEnclave-rolling-upgrade-workflow-kms-enclave-ci.yaml |
New Argo workflow for decrypt-only perf runs against mixed-version clusters. |
.github/zizmor.yml |
Adds zizmor rule ignores for the new rolling-upgrade workflow lines. |
.github/workflows/rolling-upgrade-testing.yml |
New end-to-end manual workflow to deploy, roll forward in batches, run Argo tests, collect logs, and clean up. |
.github/workflows/README.md |
Documents the new rolling upgrade workflow and its parameters/jobs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if ! command -v docker &> /dev/null; then | ||
| log_error "Docker is required to fetch PCR values from images" | ||
| exit 1 | ||
| fi | ||
|
|
||
| local IMAGE_REPO="hub.zama.org/ghcr/zama-ai/kms" | ||
| local FULL_IMAGE="${IMAGE_REPO}/core-service-enclave:${tag}" | ||
|
|
||
| log_info "Pulling ${FULL_IMAGE}..." | ||
| docker pull "${FULL_IMAGE}" > /dev/null 2>&1 || { | ||
| log_error "Failed to pull image: ${FULL_IMAGE}" | ||
| exit 1 | ||
| } | ||
|
|
||
| local pcr0 pcr1 pcr2 | ||
| pcr0=$(docker inspect "${FULL_IMAGE}" | jq -r '.[0].Config.Labels["zama.kms.eif_pcr0"]' || echo "") | ||
| pcr1=$(docker inspect "${FULL_IMAGE}" | jq -r '.[0].Config.Labels["zama.kms.eif_pcr1"]' || echo "") | ||
| pcr2=$(docker inspect "${FULL_IMAGE}" | jq -r '.[0].Config.Labels["zama.kms.eif_pcr2"]' || echo "") | ||
|
|
||
| if [[ -z "${pcr0}" || "${pcr0}" == "null" ]]; then | ||
| log_error "Failed to extract PCR0 from image ${FULL_IMAGE}" | ||
| exit 1 |
There was a problem hiding this comment.
fetch_pcrs_for_tag relies on jq to extract PCR labels but only checks for Docker. If jq is missing (or any of the labels are absent), PCR1/PCR2 can become empty/"null" and the script will continue, leading to invalid trustedReleases/attestation config. Add an explicit jq availability check and validate all required PCRs (0/1/2) are non-empty and not "null" before proceeding.
| eval "export ${prefix}_PCR0='${pcr0}'" | ||
| eval "export ${prefix}_PCR1='${pcr1}'" | ||
| eval "export ${prefix}_PCR2='${pcr2}'" | ||
|
|
There was a problem hiding this comment.
The eval "export ${prefix}_PCR*='...'" pattern is risky and unnecessary. Prefer printf -v / declare -g to assign to the dynamic variable names and then export them, which avoids eval and makes quoting rules clearer.
| eval "export ${prefix}_PCR0='${pcr0}'" | |
| eval "export ${prefix}_PCR1='${pcr1}'" | |
| eval "export ${prefix}_PCR2='${pcr2}'" | |
| local var_pcr0="${prefix}_PCR0" | |
| local var_pcr1="${prefix}_PCR1" | |
| local var_pcr2="${prefix}_PCR2" | |
| printf -v "${var_pcr0}" '%s' "${pcr0}" | |
| printf -v "${var_pcr1}" '%s' "${pcr1}" | |
| printf -v "${var_pcr2}" '%s' "${pcr2}" | |
| export "${var_pcr0}" "${var_pcr1}" "${var_pcr2}" |
| if [[ -z "${PARTIES_TO_UPGRADE}" ]]; then | ||
| log_error "--parties-to-upgrade is required (comma-separated party IDs)" | ||
| exit 1 | ||
| fi | ||
| if [[ -z "${ALL_UPGRADED_PARTIES}" ]]; then | ||
| ALL_UPGRADED_PARTIES="${PARTIES_TO_UPGRADE}" | ||
| fi |
There was a problem hiding this comment.
--parties-to-upgrade is required, but the script never uses PARTIES_TO_UPGRADE beyond logging and defaulting --all-upgraded-parties. This is confusing for callers and makes it easy to pass inconsistent inputs. Consider either removing --parties-to-upgrade (derive it from --all-upgraded-parties), or actually use it for the delta upgrade while keeping --all-upgraded-parties for cumulative policy/TLS config.
| log_info "Core Tag: ${KMS_CORE_TAG}" | ||
| log_info "Client Tag: ${KMS_CLIENT_TAG}" | ||
| log_info "TLS: ${ENABLE_TLS:-false}" | ||
| log_info "TLS: ${TLS}" |
There was a problem hiding this comment.
This log prints the TLS input variable, but the actual effective deployment toggle is ENABLE_TLS (derived later in deploy_kms). Logging TLS here can be misleading when users pass --enable-tls/--disable-tls (which set ENABLE_TLS). Consider logging the resolved/active TLS setting instead (or both).
| log_info "TLS: ${TLS}" | |
| log_info "TLS (input): ${TLS}" | |
| log_info "TLS (effective): ${ENABLE_TLS:-${TLS}}" |
| # Performs a partial upgrade of KMS parties during a rolling upgrade test. | ||
| # First updates trustedReleases on ALL parties (so they accept TLS from both | ||
| # old and new versions), then upgrades only the targeted parties' image tags. | ||
| # | ||
| # Required env vars: NAMESPACE, TARGET, DEPLOYMENT_TYPE, NUM_PARTIES, | ||
| # HELM_RELEASE_PREFIX, ENABLE_TLS | ||
| # Arguments: | ||
| # $1 - new image tag for upgraded parties | ||
| # $2 - old image tag for non-upgraded parties | ||
| # $3 - comma-separated list of party IDs to upgrade (e.g. "1,2,3,4,5") | ||
| # $4 - old PCR0 value | ||
| # $5 - old PCR1 value | ||
| # $6 - old PCR2 value | ||
| # $7 - new PCR0 value | ||
| # $8 - new PCR1 value | ||
| # $9 - new PCR2 value | ||
| # $10 - old KMS chart version (for non-upgraded parties) | ||
| # $11 - new KMS chart version (for upgraded parties) | ||
| #============================================================================= | ||
| upgrade_parties() { | ||
| local new_tag="$1" | ||
| local old_tag="$2" | ||
| local parties_csv="$3" | ||
| local old_pcr0="$4" |
There was a problem hiding this comment.
The function/docs say $3 is the list of party IDs to upgrade, but the implementation treats it as the full set of parties that should be on the new version (it selects party_tag/party_chart_version from that set). Please rename parties_csv (e.g., upgraded_parties_csv) and update the docstring so callers don't accidentally pass only the delta batch (which would cause previously-upgraded parties to be reverted).
| helm_upgrade_with_version "${release_name}" "${helm_chart_location}" \ | ||
| "${HELM_ARGS[@]}" \ | ||
| "${wait_args[@]}" & | ||
| done | ||
|
|
||
| log_info "Waiting for all ${NUM_PARTIES} party upgrades to complete..." | ||
| wait | ||
|
|
There was a problem hiding this comment.
Helm upgrades are launched in the background and then a bare wait is used. wait returns only the last job's status, so failures in earlier upgrades can be missed and the script may proceed with a partially-upgraded cluster. Track PIDs and check each exit status (or use wait -n in a loop) so any failed party upgrade fails the rolling-upgrade step.
| IFS=',' read -ra UPGRADED_IDS <<< "${upgraded_parties_csv}" | ||
| local override_idx=0 | ||
| for party_id in "${UPGRADED_IDS[@]}"; do | ||
| party_id=$(echo "${party_id}" | tr -d ' ') | ||
| OVERRIDE_ARGS+=( | ||
| --set "kmsParties.partyOverrides[${override_idx}].partyIndex=${party_id}" | ||
| --set "kmsParties.partyOverrides[${override_idx}].awsKms.recipientAttestationImageSHA384=${new_pcr0}" | ||
| ) | ||
| override_idx=$((override_idx + 1)) | ||
| done |
There was a problem hiding this comment.
If upgraded_parties_csv contains empty entries (e.g., trailing comma or an explicitly empty input), this loop will generate partyIndex= overrides and likely make the Helm upgrade fail in a non-obvious way. Filter out empty IDs (and ideally validate they are numeric and within NUM_PARTIES) before appending to OVERRIDE_ARGS.
| kubectl wait --for=condition=ready Kmsparties "${resource_name}" \ | ||
| -n "${NAMESPACE}" --timeout=300s || { | ||
| log_warn "Timeout waiting for ${resource_name}, checking status..." | ||
| kubectl get Kmsparties "${resource_name}" -n "${NAMESPACE}" -o yaml || true |
There was a problem hiding this comment.
On timeout, kubectl wait failure is swallowed by the || { ... } block, so the function always continues and the rolling upgrade can proceed even though key policy reconciliation never became Ready. If the intent is to gate the rest of the workflow on policy updates, return a non-zero status after the debug dump (or re-exit 1) so CI fails fast when reconciliation doesn't complete.
| kubectl get Kmsparties "${resource_name}" -n "${NAMESPACE}" -o yaml || true | |
| kubectl get Kmsparties "${resource_name}" -n "${NAMESPACE}" -o yaml || true | |
| return 1 |
| # Purpose | ||
| # ------- | ||
| # Validates partial rolling upgrades of KMS enclave nodes (thresholdWithEnclave) | ||
| # on the aws-perf cluster (namespace kms-ci). The cluster must behave correctly | ||
| # while some parties run the old image and others the new image. | ||
| # | ||
| # What mixed-version state exercises | ||
| # ---------------------------------- | ||
| # - Per-party AWS KMS key policy (recipientAttestationImageSHA384) | ||
| # - Dual trustedReleases PCR entries for TLS peer attestation | ||
| # - Per-party image tag upgrades (via ci/scripts/rolling_upgrade.sh) | ||
| # |
There was a problem hiding this comment.
The PR title mentions "kind testing", but this workflow (and the new scripts) target the aws-perf cluster via Tailscale and Argo. Either update the PR title/description to match the actual scope, or adjust the implementation if kind-based rolling upgrade testing was intended.
Description of changes
Issue ticket number and link
PR Checklist
I attest that all checked items are satisfied. Any deviation is clearly justified above.
chore: ...).TODO(#issue).unwrap/expect/paniconly in tests or for invariant bugs (documented if present).devopslabel + infra notified + infra-team reviewer assigned.!and affected teams notified.Zeroize+ZeroizeOnDropimplemented.unsafe; if unavoidable: minimal, justified, documented, and test/fuzz covered.Dependency Update Questionnaire (only if deps changed or added)
Answer in the
Cargo.tomlnext to the dependency (or here if updating):More details and explanations for the checklist and dependency updates can be found in CONTRIBUTING.md