Skip to content

Commit 484f176

Browse files
fix(gke-kubeconfig): skip gracefully when gcloud is unavailable (#345)
* fix(gke-kubeconfig): skip gracefully when gcloud is unavailable #344 Closes: #344 Assisted-by: claude-code/claude-opus-4-8 * docs(gke-kubeconfig): document WIF gate and skip-vs-fail policy #344 Refs: #344 Assisted-by: claude-code/claude-opus-4-8 * refactor(gke-kubeconfig): make gcloud checks pure predicates #344 Refs: #344 Assisted-by: claude-code/claude-opus-4-8 * fix(gke-kubeconfig): decouple kubeconfig generation from ENABLE_GCP_WIF #344 Refs: #344 Assisted-by: claude-code/claude-opus-4-8
1 parent bb97550 commit 484f176

5 files changed

Lines changed: 58 additions & 35 deletions

File tree

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,10 @@ to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
1212

1313
- [2026.05.08] - New portable GitLab CI template `templates/functions/gke-kubeconfig.yml` (`.gke-kubeconfig`) that generates a namespace-scoped GKE kubeconfig using WIF-authenticated gcloud credentials. Activated when `ENABLE_GCP_WIF=1` and `K8S_CLUSTER_NAME` is set. Supports `K8S_USE_DNS_ENDPOINT=1` for private clusters. Runs after `setup-gitlab-agent` so the gcloud context always takes precedence. Remotely includable, no Docker image dependency.
1414

15+
### Fixed
16+
17+
- [2026.06.18] - `.gke-kubeconfig` now skips gracefully (without failing the job) when `gcloud` is not available in the job image or is not authenticated, instead of exiting non-zero. This prevents build and test jobs that inherit the global `before_script` but do not need cluster access from failing when `K8S_CLUSTER_NAME` is set as a global CI/CD variable. Generation is gated on `K8S_CLUSTER_NAME` alone and is no longer coupled to `ENABLE_GCP_WIF`, so any gcloud authentication method (WIF, runner service account, service account key) is supported. It still fails fast when `gcloud` is authenticated but a required variable is missing or `get-credentials` fails.
18+
1519
### Changed
1620

1721
- [2025.10.23] - add support for additional docker registry via `ADDITIONAL_DOCKER_REGISTRY` variable.

openspec/changes/wif-gke-kubeconfig/design.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -64,11 +64,13 @@ Running `.gke-kubeconfig` last — after `setup-gitlab-agent` — ensures the gc
6464

6565
**Rationale:** The image scripts are called by `scripts/kubectl` and `scripts/destroy` for token-based pipelines only. Adding WIF awareness there breaks the portability pattern — the logic should live in a template, not baked into the image. The template approach covers all use cases without image changes.
6666

67-
### 4. Branch condition: `ENABLE_GCP_WIF=1 && K8S_CLUSTER_NAME` is set
67+
### 4. Branch condition: `K8S_CLUSTER_NAME` is set, gated by gcloud capability
6868

69-
**Decision:** Gate `generate_gke_kubeconfig` on both `ENABLE_GCP_WIF=1` (explicit opt-in) and `K8S_CLUSTER_NAME` being non-empty.
69+
**Decision:** Gate `generate_gke_kubeconfig` on `K8S_CLUSTER_NAME` being non-empty alone. Do NOT couple it to `ENABLE_GCP_WIF`. When `K8S_CLUSTER_NAME` is set but `gcloud` is unavailable or unauthenticated, skip without failing the job; fail fast only when gcloud is authenticated but a required variable is missing or credential fetching fails.
7070

71-
**Rationale:** Same reasoning as before — `ENABLE_GCP_WIF=1` alone is insufficient since a project might use WIF for GCR/Artifact Registry without needing GKE access. `K8S_CLUSTER_NAME` unambiguously signals cluster intent.
71+
**Rationale:** `K8S_CLUSTER_NAME` unambiguously signals cluster intent. Tying generation to `ENABLE_GCP_WIF=1` would wrongly exclude principals that are already authenticated to gcloud by other means (the runner's own service account, a service account key, etc.) and only need a kubeconfig, not federation. This is consistent with Decision 1 (WIF authentication and cluster access are separate concerns) and was raised in review of the resilience fix.
72+
73+
**Alternative considered:** Also requiring `ENABLE_GCP_WIF=1`. Rejected because it couples cluster access to a specific authentication method and breaks non-WIF gcloud auth. Resilience for non-deploy jobs is instead provided by skipping gracefully when gcloud is absent or unauthenticated.
7274

7375
### 5. `K8S_USE_DNS_ENDPOINT` as a boolean flag
7476

openspec/changes/wif-gke-kubeconfig/specs/wif-gke-kubeconfig/spec.md

Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,21 +7,29 @@ The `templates/functions/gke-kubeconfig.yml` template SHALL define all its logic
77
- **WHEN** a project includes `gke-kubeconfig.yml` via `include: remote:` without using the spark-k8s-deployer image
88
- **THEN** all functions (`check_gke_env`, `generate_gke_kubeconfig`) SHALL be available in `before_script` without errors
99

10-
### Requirement: GKE kubeconfig generation gated on ENABLE_GCP_WIF and K8S_CLUSTER_NAME
11-
The `.gke-kubeconfig` `before_script` SHALL generate a GKE kubeconfig only when `ENABLE_GCP_WIF=1` AND `K8S_CLUSTER_NAME` is non-empty. In all other cases it SHALL skip silently with an informational message.
10+
### Requirement: GKE kubeconfig generation gated on K8S_CLUSTER_NAME and gcloud capability
11+
The `.gke-kubeconfig` `before_script` SHALL attempt to generate a GKE kubeconfig whenever `K8S_CLUSTER_NAME` is non-empty. Generation SHALL NOT be tied to `ENABLE_GCP_WIF`: the principal running the job may already hold the permissions to fetch a kubeconfig without Workload Identity Federation, so any gcloud authentication method is supported. When `K8S_CLUSTER_NAME` is set but `gcloud` is unavailable or unauthenticated, the template SHALL skip without failing the job.
1212

13-
#### Scenario: Kubeconfig generated when both conditions are met
14-
- **WHEN** `ENABLE_GCP_WIF=1` and `K8S_CLUSTER_NAME` is set
13+
#### Scenario: Kubeconfig generated when cluster intent and gcloud capability are present
14+
- **WHEN** `K8S_CLUSTER_NAME` is set and `gcloud` is available and authenticated (by any method)
1515
- **THEN** `generate_gke_kubeconfig` SHALL be called and a valid kubeconfig SHALL be produced
1616

17-
#### Scenario: Skipped when ENABLE_GCP_WIF is not 1
18-
- **WHEN** `ENABLE_GCP_WIF` is unset, empty, or `"0"`
17+
#### Scenario: Skipped when K8S_CLUSTER_NAME is absent
18+
- **WHEN** `K8S_CLUSTER_NAME` is unset or empty
1919
- **THEN** the template SHALL print a skip message and exit without error
2020

21-
#### Scenario: Skipped when K8S_CLUSTER_NAME is absent
22-
- **WHEN** `ENABLE_GCP_WIF=1` but `K8S_CLUSTER_NAME` is unset or empty
21+
#### Scenario: Skipped when gcloud is not available in the job image
22+
- **WHEN** `K8S_CLUSTER_NAME` is set but the `gcloud` command is not on `PATH` (e.g. a build or test job using an image without the Cloud SDK)
23+
- **THEN** the template SHALL print a skip message and exit without error, so non-deploy jobs that inherit the global `before_script` are not failed
24+
25+
#### Scenario: Skipped when gcloud is present but not authenticated
26+
- **WHEN** `K8S_CLUSTER_NAME` is set and `gcloud` is available but no account is active (e.g. no authentication step ran or it did not succeed)
2327
- **THEN** the template SHALL print a skip message and exit without error
2428

29+
#### Scenario: Fails fast on a real generation error
30+
- **WHEN** `K8S_CLUSTER_NAME` is set, `gcloud` is available and authenticated, but a required variable is missing or `gcloud container clusters get-credentials` fails
31+
- **THEN** the template SHALL print a descriptive error and exit non-zero
32+
2533
### Requirement: GKE variable validation
2634
`check_gke_env()` SHALL validate that `K8S_CLUSTER_NAME`, `K8S_LOCATION`, `GCP_PROJECT_ID`, and `KUBE_NAMESPACE` are all non-empty. On any missing variable it SHALL print a descriptive error and return non-zero.
2735

openspec/changes/wif-gke-kubeconfig/tasks.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@
2222
- [ ] 4.3 Confirm a pipeline with `ENABLE_GCP_WIF=1` but without `K8S_CLUSTER_NAME` skips silently without errors
2323
- [ ] 4.4 Confirm that when both GitLab Agent and WIF+GKE are configured, the gcloud context is active after `before_script` completes
2424
- [ ] 4.5 Test `K8S_USE_DNS_ENDPOINT=1` on a private cluster to confirm `--dns-endpoint` is passed and connectivity succeeds
25+
- [x] 4.6 Confirm a job with `K8S_CLUSTER_NAME` set but no `gcloud` in the image skips without failing (simulated locally); fail-fast preserved when gcloud is authenticated but vars missing or generation fails. Generation is decoupled from `ENABLE_GCP_WIF` so any gcloud auth method works.
2526

2627
## 5. Documentation
2728

templates/functions/gke-kubeconfig.yml

Lines changed: 32 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,21 @@
11
# This template generates a namespace-scoped kubeconfig for GKE clusters.
2-
# It is an alternative to the GitLab Agent approach and can work with any
3-
# gcloud authentication method (WIF, service account key, etc.).
4-
# It requires gcloud to be already authenticated before this template runs.
2+
# It is an alternative to the GitLab Agent approach and works with any gcloud
3+
# authentication method (WIF, runner service account, service account key, etc.).
4+
# It is deliberately NOT tied to WIF: the principal running the job may already
5+
# hold the permissions to fetch a kubeconfig without federation.
6+
#
7+
# Generation is attempted whenever K8S_CLUSTER_NAME is non-empty (the signal of
8+
# cluster intent) and expects gcloud to be already authenticated by some earlier
9+
# step (e.g. gcp-wif.yml in the before_script chain, or the runner's own
10+
# credentials).
11+
#
12+
# This template is included in the global before_script and therefore runs in
13+
# every job. It is resilient by design: when gcloud is not present in the job
14+
# image, or is present but not authenticated, it skips without failing the job,
15+
# so build/test jobs that inherit K8S_CLUSTER_NAME as a global CI variable but do
16+
# not need cluster access are not broken. It fails the job (exit 1) only when
17+
# gcloud is authenticated and a kubeconfig was clearly intended but a required
18+
# variable is missing or credential fetching fails.
519
#
620
# Example:
721
# include:
@@ -19,17 +33,14 @@
1933
before_script:
2034
# Functions
2135
- |
22-
check_gcloud_auth() {
23-
if ! command -v gcloud &> /dev/null; then
24-
echo "The gcloud command is not available. Cannot generate GKE kubeconfig."
25-
return 1
26-
fi
36+
check_gcloud() {
37+
command -v gcloud &> /dev/null
38+
}
2739
40+
check_gcloud_auth() {
2841
local active_account
2942
active_account=$(gcloud auth list --filter="status=ACTIVE" --format="value(account)" 2>/dev/null)
3043
if [ -z "${active_account}" ]; then
31-
echo "No active gcloud authenticated account found. Cannot generate GKE kubeconfig."
32-
echo "Authenticate gcloud before using this template (e.g. via gcp-wif.yml)."
3344
return 1
3445
fi
3546
@@ -89,21 +100,18 @@
89100
print-banner "GKE KUBECONFIG"
90101
fi
91102
if [ -n "${K8S_CLUSTER_NAME:-}" ]; then
92-
if check_gcloud_auth; then
93-
if check_gke_env; then
94-
if generate_gke_kubeconfig; then
95-
echo "GKE kubeconfig generated and scoped to namespace ${KUBE_NAMESPACE}."
96-
else
97-
echo "GKE kubeconfig generation failed."
98-
exit 1
99-
fi
100-
else
101-
echo "GKE kubeconfig generation skipped due to missing variables."
102-
exit 1
103-
fi
104-
else
105-
echo "GKE kubeconfig generation skipped due to missing gcloud authentication."
103+
if ! check_gcloud; then
104+
echo "GKE kubeconfig generation skipped: gcloud is not available in this job image."
105+
elif ! check_gcloud_auth; then
106+
echo "GKE kubeconfig generation skipped: gcloud is not authenticated."
107+
elif ! check_gke_env; then
108+
echo "GKE kubeconfig generation failed: required variables missing."
106109
exit 1
110+
elif ! generate_gke_kubeconfig; then
111+
echo "GKE kubeconfig generation failed."
112+
exit 1
113+
else
114+
echo "GKE kubeconfig generated and scoped to namespace ${KUBE_NAMESPACE}."
107115
fi
108116
else
109117
echo "GKE kubeconfig generation skipped (K8S_CLUSTER_NAME not set)."

0 commit comments

Comments
 (0)