The spark-k8s-deployer ops-base image is used in GitLab CI/CD pipelines to deploy applications to Kubernetes. It currently provides two methods for cluster access:
- Token-based (
create_kubeconfig): usesKUBE_URL+KUBE_TOKEN+KUBE_CA_PEMto build a kubeconfig from static credentials. Lives inscripts/src/functions.bash(baked into the Docker image). - GitLab Agent (
setup-gitlab-agent): useskubectl config use-contextto switch to an agent-managed context. Also lives inscripts/src/functions.bash.
WIF authentication to GCP is handled by templates/functions/gcp-wif.yml (create_wif()), which runs early in the before_script chain and leaves gcloud fully authenticated. This template is self-contained and remotely includable — it has no dependency on the Docker image.
The existing template portability pattern is:
templates/functions/*.yml→ functions defined inline in YAMLbefore_script, includable viainclude: remote:independently of the image.scripts/src/functions.bash→ functions baked into the image, only available when running inside the deployer container.
The initial implementation of this change broke this pattern by adding WIF+GKE logic to functions.bash. This design corrects that by introducing a new portable template following the same convention as gcp-wif.yml.
The before_script execution chain (defined in templates/.gitlab-ci-template.yml) runs in this order:
1. .gitlab-helper-functions (section_start/end helpers)
2. .gcp-wif (WIF auth → gcloud authenticated)
3. .default-setup (sources functions.bash → setup-gitlab-agent → job info)
4. .gke-kubeconfig (GKE kubeconfig generation — NEW, runs last)
Running .gke-kubeconfig last — after setup-gitlab-agent — ensures the gcloud-based kubeconfig always overrides the agent context, providing a safety net for users who forget to set DISABLE_GITLAB_AGENT=1.
Goals:
- Introduce a portable, remotely-includable
gke-kubeconfig.ymltemplate that generates a GKE kubeconfig using WIF-authenticated gcloud — no image dependency. - Run kubeconfig generation after
setup-gitlab-agentso gcloud context always wins. - Support
K8S_USE_DNS_ENDPOINT=1for private clusters. - Scope the kubeconfig to
$KUBE_NAMESPACE. - Emit a CI log banner section for visibility.
- Revert
functions.bashto its pre-change state — no WIF awareness in image scripts. - Preserve full backward compatibility.
Non-Goals:
- Changes to
gcp-wif.yml(handles GCP auth, stays as-is). - Changes to
setup-gitlab-agent()(works correctly, stays as-is). - Supporting non-GKE clusters via WIF.
- Generator-side changes (handled in board#4348).
Decision: New file templates/functions/gke-kubeconfig.yml with anchor .gke-kubeconfig.
Rationale: Separation of concerns. gcp-wif.yml handles GCP authentication; gke-kubeconfig.yml handles cluster access. They can be used independently — a project might use WIF for Artifact Registry access only, without needing GKE access. A separate template makes each file's responsibility explicit and mirrors the existing split between gcp-wif.yml and gitlab-helper-functions.yml.
Alternative considered: Extending gcp-wif.yml with the kubeconfig logic. Rejected because it couples two distinct concerns (auth vs. cluster access) in one file, making the template harder to reason about and reuse independently.
Decision: !reference [.gke-kubeconfig, before_script] is added as the final step in .global-setup, after .default-setup (which contains setup-gitlab-agent).
Rationale: When a user forgets to set DISABLE_GITLAB_AGENT=1, the agent may configure its own kubeconfig context. Running .gke-kubeconfig after ensures the gcloud-based context always overrides, making WIF+GKE the authoritative path when configured. This is a deliberate safety net.
Alternative considered: Running before setup-gitlab-agent (inside .gcp-wif or between steps 2 and 3). Rejected because agent setup could then override the gcloud context, producing the opposite of the desired behavior.
Decision: Revert create_kubeconfig() and ensure_deploy_variables() to their pre-change state. No WIF branch, no GKE variables checked.
Rationale: The image scripts are called by scripts/kubectl and scripts/destroy for token-based pipelines only. Adding WIF awareness there breaks the portability pattern — the logic should live in a template, not baked into the image. The template approach covers all use cases without image changes.
Decision: Gate generate_gke_kubeconfig on K8S_CLUSTER_NAME being non-empty alone. Do NOT couple it to ENABLE_GCP_WIF. When K8S_CLUSTER_NAME is set but gcloud is unavailable or unauthenticated, skip without failing the job; fail fast only when gcloud is authenticated but a required variable is missing or credential fetching fails.
Rationale: K8S_CLUSTER_NAME unambiguously signals cluster intent. Tying generation to ENABLE_GCP_WIF=1 would wrongly exclude principals that are already authenticated to gcloud by other means (the runner's own service account, a service account key, etc.) and only need a kubeconfig, not federation. This is consistent with Decision 1 (WIF authentication and cluster access are separate concerns) and was raised in review of the resilience fix.
Alternative considered: Also requiring ENABLE_GCP_WIF=1. Rejected because it couples cluster access to a specific authentication method and breaks non-WIF gcloud auth. Resilience for non-deploy jobs is instead provided by skipping gracefully when gcloud is absent or unauthenticated.
Decision: When K8S_USE_DNS_ENDPOINT=1, append --dns-endpoint to the gcloud container clusters get-credentials command.
Rationale: The platform generator sets this as "1"/"0" (board#4348). The --dns-endpoint flag instructs gcloud to use the cluster's DNS-based endpoint (Private Service Connect), required for private GKE clusters.
gcloudmust be authenticated before.gke-kubeconfigruns → guaranteed by.gcp-wifrunning earlier in the chain. IfENABLE_GCP_WIF=0butK8S_CLUSTER_NAMEis set, the template skips silently (gate condition not met). No risk.- Execution order dependency →
.gke-kubeconfigmust always be the last!referencein.global-setup. If a future template is added after it, it could inadvertently switch the active context. Mitigation: document the ordering constraint clearly. - Namespace scoping is advisory, not enforced by RBAC →
kubectl config set-context --current --namespacesets the default namespace for commands but does not prevent cross-namespace access if the service account has cluster-wide permissions. True enforcement requires RBAC on the GCP side.
No migration required. The change is purely additive for new WIF-enabled pipelines. The revert of functions.bash removes the previously merged WIF branches, restoring the original behavior for all existing pipelines.
Rollback: Remove gke-kubeconfig.yml and its include: + !reference entries from .gitlab-ci-template.yml. No state is persisted between runs.
None — all design decisions from the previous iteration are resolved.