Skip to content

fix(gke-kubeconfig): skip gracefully when gcloud is unavailable#345

Merged
paolomainardi merged 4 commits into
masterfrom
fix/gke-kubeconfig-skip-without-gcloud
Jun 18, 2026
Merged

fix(gke-kubeconfig): skip gracefully when gcloud is unavailable#345
paolomainardi merged 4 commits into
masterfrom
fix/gke-kubeconfig-skip-without-gcloud

Conversation

@paolomainardi

@paolomainardi paolomainardi commented Jun 18, 2026

Copy link
Copy Markdown
Member

🤖 This was written by an AI agent on behalf of @paolomainardi.

Summary

The .gke-kubeconfig template runs in the before_script of every job through the .global-setup chain. When a project sets K8S_CLUSTER_NAME (and the related K8S_LOCATION, GCP_PROJECT_ID, KUBE_NAMESPACE) as project level CI/CD variables, those apply to all jobs, including build and test jobs whose image does not ship gcloud/kubectl. Those jobs previously failed in before_script because the template called exit 1 when gcloud was missing, even though they never needed cluster access.

This aligns .gke-kubeconfig with the existing gcp-wif.yml behaviour, which skips gracefully when gcloud is unavailable. No per job overrides are required, so projects that set the variables globally do not need to rewrite their jobs.

Following review feedback, kubeconfig generation is decoupled from ENABLE_GCP_WIF. The principal running the job may already hold the permissions to fetch a kubeconfig without Workload Identity Federation (for example via the runner's own service account or a service account key), so the template no longer requires WIF. Cluster intent is signalled by K8S_CLUSTER_NAME alone, and gcloud authentication can come from any method. This is consistent with the original design decision that WIF authentication and cluster access are separate concerns.

Behaviour

Gated on K8S_CLUSTER_NAME being non-empty:

  • gcloud not available in the job image: print a skip message, exit 0.
  • gcloud present but not authenticated: print a skip message, exit 0.
  • gcloud authenticated but a required variable is missing, or gcloud container clusters get-credentials fails: print an error, exit 1 (fail fast for real deploy jobs).
  • gcloud authenticated and all variables present: generate the kubeconfig and scope it to $KUBE_NAMESPACE.

K8S_CLUSTER_NAME unset: print a skip message, exit 0.

Changes

  • templates/functions/gke-kubeconfig.yml:
    • Added a check_gcloud() pure predicate (command exists check); check_gcloud_auth() only checks for an active account. Helpers no longer echo on the skip paths, so the main decision ladder owns the user-facing messages (avoids duplicate log lines, consistent with gcp-wif.yml).
    • Rewrote the main execution block into a decision ladder gated on K8S_CLUSTER_NAME, decoupled from ENABLE_GCP_WIF.
    • Documented the WIF decoupling and the skip-vs-fail policy in the file header.
  • openspec/changes/wif-gke-kubeconfig/specs/wif-gke-kubeconfig/spec.md: updated the gating requirement to K8S_CLUSTER_NAME plus gcloud capability, with scenarios for the skip and fail-fast paths.
  • openspec/changes/wif-gke-kubeconfig/design.md: revised Decision 4 to gate on K8S_CLUSTER_NAME alone and explain why generation is not tied to WIF.
  • openspec/changes/wif-gke-kubeconfig/tasks.md: recorded the validation.
  • CHANGELOG.md: added a Fixed entry.

Testing

Simulated all five paths locally with stubbed gcloud/kubectl, with no ENABLE_GCP_WIF set:

  • K8S_CLUSTER_NAME unset: skip, exit 0.
  • K8S_CLUSTER_NAME set, no gcloud on PATH: skip, exit 0.
  • gcloud authenticated with all variables and successful get-credentials: exit 0.
  • gcloud authenticated with a missing variable: exit 1.
  • gcloud authenticated with all variables but failing get-credentials: exit 1.

Related

Closes #344

Closes: #344
Assisted-by: claude-code/claude-opus-4-8
Copilot AI review requested due to automatic review settings June 18, 2026 15:55
@sparkfabrik-ai-bot

Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis ✅

344 - PR Code Verified

Compliant requirements:

  • Skip gracefully when gcloud is not available in the job image
  • Skip gracefully when gcloud is present but not authenticated
  • Fail fast when gcloud is available and authenticated but a required variable is missing or get-credentials fails
  • Restored ENABLE_GCP_WIF=1 gate alongside K8S_CLUSTER_NAME check

Requires further human verification:

  • End-to-end pipeline testing: confirming that jobs without gcloud actually skip without failing in a real GitLab CI environment (task 4.6 is marked as "simulated locally" only)
  • Confirming that jobs with gcloud authenticated and all variables set still successfully generate the kubeconfig
⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Silent skip on auth failure

When gcloud is present but not authenticated (check_gcloud_auth returns non-zero), the template now silently skips kubeconfig generation and exits 0. If WIF authentication was supposed to run earlier in the chain but silently failed, deploy jobs will proceed without a valid kubeconfig and fail later with a less obvious error (e.g., kubectl commands failing mid-job). The ticket explicitly calls this out as acceptable, but it means a misconfigured WIF setup will no longer surface a clear error at the kubeconfig step for jobs that actually need cluster access.

elif ! check_gcloud_auth; then
  echo "GKE kubeconfig generation skipped: gcloud is not authenticated."

@sparkfabrik-ai-bot

Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Improve error message clarity for missing variables

According to the spec, when gcloud is authenticated but required variables are
missing, the template should "fail fast" with a non-zero exit. However, the
CHANGELOG and spec also state that generation is gated on ENABLE_GCP_WIF=1 and
K8S_CLUSTER_NAME, which is already handled. The current message says "failed" but
the condition is about missing variables — the message should clearly distinguish a
configuration error from a runtime failure to help users debug. Consider updating
the message to be more descriptive about which variables are missing.

templates/functions/gke-kubeconfig.yml [100-102]

 elif ! check_gke_env; then
-      echo "GKE kubeconfig generation failed: required variables missing."
+      echo "GKE kubeconfig generation failed: one or more required variables (K8S_CLUSTER_NAME, K8S_LOCATION, GCP_PROJECT_ID, KUBE_NAMESPACE) are missing."
       exit 1
Suggestion importance[1-10]: 4

__

Why: The suggestion improves the error message to list the specific required variables, which aids debugging. However, check_gke_env() already prints a descriptive error per the spec, so the outer message is secondary and this is a minor readability improvement.

Low
Guard against non-standard truthy values for flag

The old code path (before this PR) only gated on K8S_CLUSTER_NAME, meaning jobs with
ENABLE_GCP_WIF unset but K8S_CLUSTER_NAME set would still attempt kubeconfig
generation. The new code correctly gates on both, but the ENABLE_GCP_WIF variable is
set locally just before the check. If ENABLE_GCP_WIF is already exported as a CI/CD
variable with a value other than "1" (e.g. "true" or "yes"), the check will silently
skip. Consider documenting or asserting that only the value "1" is accepted, or
expand the check to handle common truthy values.

templates/functions/gke-kubeconfig.yml [94-95]

 ENABLE_GCP_WIF="${ENABLE_GCP_WIF:-0}"
-  if [ "${ENABLE_GCP_WIF}" = "1" ] && [ -n "${K8S_CLUSTER_NAME:-}" ]; then
+  if [ "${ENABLE_GCP_WIF}" != "1" ] && [ "${ENABLE_GCP_WIF}" != "true" ] && [ -n "${K8S_CLUSTER_NAME:-}" ]; then
+    echo "GKE kubeconfig generation skipped (ENABLE_GCP_WIF must be set to '1', got '${ENABLE_GCP_WIF}')."
+  elif [ "${ENABLE_GCP_WIF}" = "1" ] && [ -n "${K8S_CLUSTER_NAME:-}" ]; then
Suggestion importance[1-10]: 2

__

Why: The improved_code contains a logical error — the condition [ "${ENABLE_GCP_WIF}" != "1" ] && [ "${ENABLE_GCP_WIF}" != "true" ] && [ -n "${K8S_CLUSTER_NAME:-}" ] would trigger the skip message even when ENABLE_GCP_WIF is "0" and K8S_CLUSTER_NAME is set, which is not the intended behavior. The existing documentation consistently uses "1" as the only accepted value, making this change unnecessary and the proposed code incorrect.

Low

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the GitLab CI .gke-kubeconfig template to avoid failing non-deploy jobs when ENABLE_GCP_WIF=1 and K8S_CLUSTER_NAME are set globally, by skipping gracefully when gcloud is missing (or present but unauthenticated). It also documents the new branching behavior in the associated OpenSpec change artifacts and records the fix in the changelog.

Changes:

  • Refactors templates/functions/gke-kubeconfig.yml to gate execution on ENABLE_GCP_WIF=1 && K8S_CLUSTER_NAME, and to skip with exit 0 when gcloud is unavailable or unauthenticated.
  • Extends the OpenSpec spec/tasks to cover the new skip/fail-fast scenarios.
  • Adds a changelog entry describing the behavior change and its motivation.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
templates/functions/gke-kubeconfig.yml Adds a gcloud presence check and rewrites the main decision ladder to skip gracefully when appropriate while preserving fail-fast for real generation errors.
openspec/changes/wif-gke-kubeconfig/tasks.md Records validation coverage for the “no gcloud in image” skip path and preserved fail-fast behavior.
openspec/changes/wif-gke-kubeconfig/specs/wif-gke-kubeconfig/spec.md Documents explicit scenarios for skipping on missing/unauthenticated gcloud and failing fast on real errors.
CHANGELOG.md Adds a “Fixed” entry describing the new graceful-skip behavior and its impact on pipelines.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread templates/functions/gke-kubeconfig.yml
Comment thread templates/functions/gke-kubeconfig.yml
Refs: #344
Assisted-by: claude-code/claude-opus-4-8
Refs: #344
Assisted-by: claude-code/claude-opus-4-8
Comment thread templates/functions/gke-kubeconfig.yml Outdated
else
echo "GKE kubeconfig generation skipped due to missing gcloud authentication."
ENABLE_GCP_WIF="${ENABLE_GCP_WIF:-0}"
if [ "${ENABLE_GCP_WIF}" = "1" ] && [ -n "${K8S_CLUSTER_NAME:-}" ]; then

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't link the generation of the kubeconfig to the federation. The principal running the runner might not need the federation but might already have the permissions to obtain a kubeconfig.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This was written by an AI agent on behalf of @paolomainardi.

Good point, agreed. Generation is now decoupled from ENABLE_GCP_WIF in commit a6cb2ff. The gate is K8S_CLUSTER_NAME alone, which signals cluster intent, and gcloud can be authenticated by any method (the runner's own service account, a service account key, or WIF). When gcloud is unavailable or unauthenticated the template skips without failing the job, and it only fails fast when gcloud is authenticated but a required variable is missing or get-credentials fails. The file header, the OpenSpec spec, and design Decision 4 were updated to reflect that cluster access and federation are separate concerns.

@Monska85 Monska85 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@paolomainardi paolomainardi merged commit 484f176 into master Jun 18, 2026
2 checks passed
@paolomainardi paolomainardi deleted the fix/gke-kubeconfig-skip-without-gcloud branch June 18, 2026 22:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GKE kubeconfig before_script fails jobs without gcloud

4 participants