fix(gke-kubeconfig): skip gracefully when gcloud is unavailable#345
Conversation
Closes: #344 Assisted-by: claude-code/claude-opus-4-8
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
There was a problem hiding this comment.
Pull request overview
This PR updates the GitLab CI .gke-kubeconfig template to avoid failing non-deploy jobs when ENABLE_GCP_WIF=1 and K8S_CLUSTER_NAME are set globally, by skipping gracefully when gcloud is missing (or present but unauthenticated). It also documents the new branching behavior in the associated OpenSpec change artifacts and records the fix in the changelog.
Changes:
- Refactors
templates/functions/gke-kubeconfig.ymlto gate execution onENABLE_GCP_WIF=1 && K8S_CLUSTER_NAME, and to skip with exit 0 whengcloudis unavailable or unauthenticated. - Extends the OpenSpec spec/tasks to cover the new skip/fail-fast scenarios.
- Adds a changelog entry describing the behavior change and its motivation.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| templates/functions/gke-kubeconfig.yml | Adds a gcloud presence check and rewrites the main decision ladder to skip gracefully when appropriate while preserving fail-fast for real generation errors. |
| openspec/changes/wif-gke-kubeconfig/tasks.md | Records validation coverage for the “no gcloud in image” skip path and preserved fail-fast behavior. |
| openspec/changes/wif-gke-kubeconfig/specs/wif-gke-kubeconfig/spec.md | Documents explicit scenarios for skipping on missing/unauthenticated gcloud and failing fast on real errors. |
| CHANGELOG.md | Adds a “Fixed” entry describing the new graceful-skip behavior and its impact on pipelines. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| else | ||
| echo "GKE kubeconfig generation skipped due to missing gcloud authentication." | ||
| ENABLE_GCP_WIF="${ENABLE_GCP_WIF:-0}" | ||
| if [ "${ENABLE_GCP_WIF}" = "1" ] && [ -n "${K8S_CLUSTER_NAME:-}" ]; then |
There was a problem hiding this comment.
I wouldn't link the generation of the kubeconfig to the federation. The principal running the runner might not need the federation but might already have the permissions to obtain a kubeconfig.
There was a problem hiding this comment.
🤖 This was written by an AI agent on behalf of @paolomainardi.
Good point, agreed. Generation is now decoupled from ENABLE_GCP_WIF in commit a6cb2ff. The gate is K8S_CLUSTER_NAME alone, which signals cluster intent, and gcloud can be authenticated by any method (the runner's own service account, a service account key, or WIF). When gcloud is unavailable or unauthenticated the template skips without failing the job, and it only fails fast when gcloud is authenticated but a required variable is missing or get-credentials fails. The file header, the OpenSpec spec, and design Decision 4 were updated to reflect that cluster access and federation are separate concerns.
Summary
The
.gke-kubeconfigtemplate runs in thebefore_scriptof every job through the.global-setupchain. When a project setsK8S_CLUSTER_NAME(and the relatedK8S_LOCATION,GCP_PROJECT_ID,KUBE_NAMESPACE) as project level CI/CD variables, those apply to all jobs, including build and test jobs whose image does not shipgcloud/kubectl. Those jobs previously failed inbefore_scriptbecause the template calledexit 1whengcloudwas missing, even though they never needed cluster access.This aligns
.gke-kubeconfigwith the existinggcp-wif.ymlbehaviour, which skips gracefully whengcloudis unavailable. No per job overrides are required, so projects that set the variables globally do not need to rewrite their jobs.Following review feedback, kubeconfig generation is decoupled from
ENABLE_GCP_WIF. The principal running the job may already hold the permissions to fetch a kubeconfig without Workload Identity Federation (for example via the runner's own service account or a service account key), so the template no longer requires WIF. Cluster intent is signalled byK8S_CLUSTER_NAMEalone, and gcloud authentication can come from any method. This is consistent with the original design decision that WIF authentication and cluster access are separate concerns.Behaviour
Gated on
K8S_CLUSTER_NAMEbeing non-empty:gcloudnot available in the job image: print a skip message, exit 0.gcloudpresent but not authenticated: print a skip message, exit 0.gcloudauthenticated but a required variable is missing, orgcloud container clusters get-credentialsfails: print an error, exit 1 (fail fast for real deploy jobs).gcloudauthenticated and all variables present: generate the kubeconfig and scope it to$KUBE_NAMESPACE.K8S_CLUSTER_NAMEunset: print a skip message, exit 0.Changes
templates/functions/gke-kubeconfig.yml:check_gcloud()pure predicate (command exists check);check_gcloud_auth()only checks for an active account. Helpers no longer echo on the skip paths, so the main decision ladder owns the user-facing messages (avoids duplicate log lines, consistent withgcp-wif.yml).K8S_CLUSTER_NAME, decoupled fromENABLE_GCP_WIF.openspec/changes/wif-gke-kubeconfig/specs/wif-gke-kubeconfig/spec.md: updated the gating requirement toK8S_CLUSTER_NAMEplus gcloud capability, with scenarios for the skip and fail-fast paths.openspec/changes/wif-gke-kubeconfig/design.md: revised Decision 4 to gate onK8S_CLUSTER_NAMEalone and explain why generation is not tied to WIF.openspec/changes/wif-gke-kubeconfig/tasks.md: recorded the validation.CHANGELOG.md: added a Fixed entry.Testing
Simulated all five paths locally with stubbed
gcloud/kubectl, with noENABLE_GCP_WIFset:K8S_CLUSTER_NAMEunset: skip, exit 0.K8S_CLUSTER_NAMEset, nogcloudon PATH: skip, exit 0.gcloudauthenticated with all variables and successfulget-credentials: exit 0.gcloudauthenticated with a missing variable: exit 1.gcloudauthenticated with all variables but failingget-credentials: exit 1.Related
Closes #344