Skip to content

feat(dev): add lightweight KFP local dev cluster via k3d#753

Open
cordeirops wants to merge 3 commits intokubeflow:mainfrom
cordeirops:feat/k3d-local-dev-cluster
Open

feat(dev): add lightweight KFP local dev cluster via k3d#753
cordeirops wants to merge 3 commits intokubeflow:mainfrom
cordeirops:feat/k3d-local-dev-cluster

Conversation

@cordeirops
Copy link
Copy Markdown
Contributor

Summary

Running a full Kubeflow Pipelines stack for local development currently requires minikube (or equivalent), kubectl port-forward, and enough memory to host both the Kubernetes control plane (~2 GB) and the KFP services (~2 GB). This makes it impractical for contributors with resource-constrained machines or those who only need to work on Kale's frontend or backend — not on actual pipeline execution.

This PR introduces a k3d-based local development cluster as a lightweight, one-command alternative.

k3d runs k3s (a certified, minimal Kubernetes distribution) inside Docker containers. Its control plane uses an embedded SQLite store instead of etcd, reducing cluster overhead from ~2 GB (minikube) to ~512 MB, bringing the total footprint for a full KFP stack down to approximately 2.5 GB.

Changes

scripts/kfp-dev-setup.sh (new)

An idempotent Bash script that handles the full first-time setup:

  1. Verifies docker and kubectl are available
  2. Installs k3d automatically via its official install script if not present
  3. Creates a k3d cluster named kale-kfp (Traefik ingress disabled to save memory)
  4. Deploys KFP v2.16 using the official platform-agnostic kustomize manifests from github.com/kubeflow/pipelines
  5. Waits for all KFP pods to reach Ready state
  6. Starts a kubectl port-forward in the background, tracking its PID in .kfp-dev-pf.pid for clean teardown
  7. Smoke-tests the UI endpoint and prints the next steps

The script is safe to re-run — it skips cluster creation and KFP deployment if they already exist.

Makefile (updated)

Five new targets under a dedicated "KFP Local Cluster" section:

Target Description
make kfp-dev-setup First-time cluster creation and KFP deployment (~5 min)
make kfp-dev-start Daily driver: start the cluster and port-forward UI to localhost:8080
make kfp-dev-stop Stop the port-forward and pause the cluster (all pipeline data is preserved)
make kfp-dev-delete Wipe the cluster entirely and free all resources
make kfp-dev-status Show k3d cluster state and KFP pod status

All targets are configurable via Makefile variables (KFP_CLUSTER_NAME, KFP_PIPELINE_VERSION, KFP_LOCAL_PORT).

The clean target is updated to also remove the port-forward PID file.

Developer workflow

# First time only (~5 min)
make kfp-dev-setup

# Every day
make kfp-dev-start
make kfp-run NB=examples/base/candies_sharing.ipynb KFP_HOST=http://localhost:8080

# End of day
make kfp-dev-stop

# When done with the cluster entirely
make kfp-dev-delete

Requirements

  • Docker (Docker Desktop on macOS/Windows, or Docker Engine on Linux)
  • kubectl (available via brew install kubectl or bundled with Docker Desktop)
  • k3d — installed automatically by kfp-dev-setup if not present

Resource comparison

Setup Cluster overhead Total (cluster + KFP)
minikube + KFP ~2 GB ~4–6 GB
k3d + KFP (this PR) ~512 MB ~2.5 GB

Introduce a k3d-based local development cluster as a low-resource
alternative to running minikube + KFP for contributors who only need
to work on Kale's frontend or backend.

k3d (k3s in Docker) uses ~512 MB of cluster overhead compared to ~2 GB
for minikube, bringing the total footprint for KFP down to ~2.5 GB vs
~4–6 GB with the previous approach. No Kubernetes knowledge is required
beyond installing Docker and kubectl.

Changes:
- Add scripts/kfp-dev-setup.sh: idempotent setup script that installs
  k3d if missing, creates a cluster, deploys KFP v2.16 via the official
  platform-agnostic kustomize manifests, waits for all pods to be Ready,
  and starts a background port-forward with PID tracking
- Add five Makefile targets under a new "KFP Local Cluster" section:
    make kfp-dev-setup   # first-time cluster creation (~5 min)
    make kfp-dev-start   # daily: start cluster + port-forward to :8080
    make kfp-dev-stop    # pause cluster, preserves all pipeline data
    make kfp-dev-delete  # wipe cluster entirely
    make kfp-dev-status  # inspect k3d and KFP pod status
- Update clean target to remove the port-forward PID file

After kfp-dev-start, the KFP UI and API are available at
http://localhost:8080, compatible with the existing kfp-run target.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Pedro Sbaraini Cordeiro <pedro.sbarainicordeiro@gmail.com>
@google-oss-prow google-oss-prow Bot requested a review from ederign April 13, 2026 00:44
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ederign for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Collaborator

@ada333 ada333 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @cordeirops I really like this idea - thank you for making this PR!
I left a few suggestions.

Comment thread scripts/kfp-dev-setup.sh Outdated

deploy_kfp() {
# Idempotent: if the kubeflow namespace already has the ML pipeline CRD, skip
if kubectl get namespace kubeflow >/dev/null 2>&1 && \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do this differently so KFP can be upgraded on existing cluster without the need to delete it - or maybe a new function like upgrade_kfp can be added?

Comment thread scripts/kfp-dev-setup.sh
if k3d cluster list 2>/dev/null | grep -q "^${CLUSTER_NAME}[[:space:]]"; then
warn "Cluster '${CLUSTER_NAME}' already exists — skipping creation."
info "Starting cluster in case it was stopped..."
k3d cluster start "${CLUSTER_NAME}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before this we should switch kubectl context - something like
kubectl config use-context k3d-kale-kfp

Comment thread Makefile
@bash scripts/kfp-dev-setup.sh "$(KFP_CLUSTER_NAME)" "$(KFP_PIPELINE_VERSION)" "$(KFP_LOCAL_PORT)" "$(KFP_PID_FILE)"

kfp-dev-start: ## Start existing cluster and port-forward KFP UI to localhost:8080
@printf "$(BLUE)Starting k3d cluster '$(KFP_CLUSTER_NAME)'...\n$(NC)"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here: kubectl config use-context k3d-kale-kfp

Comment thread Makefile

kfp-dev-status: ## Show cluster and KFP pod status
@printf "$(BLUE)k3d clusters:\n$(NC)"
@k3d cluster list 2>/dev/null || printf "$(YELLOW)k3d not installed\n$(NC)"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here: kubectl config use-context k3d-kale-kfp

- Switch kubectl context to k3d-{cluster} after cluster start/create
  to ensure all subsequent kubectl commands target the correct cluster,
  regardless of the active context before the script runs.

- Extract shared manifest-apply logic into _apply_kfp_manifests() and
  introduce upgrade_kfp() alongside the existing deploy_kfp(). The
  deploy function skips if KFP is already present; the upgrade function
  always re-applies the manifests, allowing in-place version bumps
  without deleting the cluster and losing experiment/run history.

- Add make kfp-dev-upgrade target (accepts KFP_PIPELINE_VERSION=X.Y.Z)
  to expose the upgrade path from the Makefile.

- Add kubectl context switch to kfp-dev-start Makefile target for
  consistency with the setup script.

Signed-off-by: Pedro Sbaraini Cordeiro <pedro.sbarainicordeiro@gmail.com>
@cordeirops
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review, @ada333! I've addressed all three suggestions in the latest commit (933ff5b).


Changes made

1. kubectl config use-context k3d-{cluster} — two locations

Added an explicit context switch in both places flagged:

  • In create_cluster() (script) — after cluster creation or start, kubectl is immediately pointed at k3d-${CLUSTER_NAME} before any kubectl apply or kubectl wait calls. This prevents commands from silently targeting a different cluster (e.g. a leftover minikube context).
  • In kfp-dev-start (Makefile) — same switch added right after k3d cluster start, so the daily-driver workflow is also safe for users who switch between multiple clusters.

2. KFP upgrade without deleting the cluster

Refactored deploy_kfp() by extracting the shared manifest-apply logic into a private helper _apply_kfp_manifests(). Two distinct functions now exist:

  • deploy_kfp() — unchanged behaviour: skips if KFP is already present, avoids re-applying on every setup run.
  • upgrade_kfp() — always calls _apply_kfp_manifests(), allowing kubectl apply -k to reconcile the diff between the currently installed version and the target version in-place, preserving all experiment and run history.

A new make kfp-dev-upgrade Makefile target exposes this:

# Upgrade to a newer KFP version on an existing cluster
make kfp-dev-upgrade KFP_PIPELINE_VERSION=2.17.0

Let me know if anything else needs adjusting!

@cordeirops cordeirops requested a review from ada333 April 13, 2026 13:30
Signed-off-by: Pedro Sbaraini Cordeiro <pedro.sbarainicordeiro@gmail.com>
@cordeirops
Copy link
Copy Markdown
Contributor Author

cordeirops commented Apr 13, 2026

Build failure (browser_check timeout)

The build job failed with a page.waitForSelector: Timeout 100000ms exceeded error in uv run python -m jupyterlab.browser_check. This step spins up a headless browser to verify JupyterLab loads, it is unrelated to the changes in this PR (Makefile + bash script, no frontend code touched).

The same workflow passed on PR #754 which ran earlier the same day. This is a known flaky behaviour in CI environments when the runner is under resource pressure.

I've pushed an empty commit to re-trigger the workflow.

@google-oss-prow google-oss-prow Bot added the lgtm label Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants