Skip to content

fix(cluster): bound dev cluster disk usage (#244)#345

Open
Tomas2D wants to merge 1 commit intomainfrom
fix/dev-cluster-disk-bounds
Open

fix(cluster): bound dev cluster disk usage (#244)#345
Tomas2D wants to merge 1 commit intomainfrom
fix/dev-cluster-disk-bounds

Conversation

@Tomas2D
Copy link
Copy Markdown
Contributor

@Tomas2D Tomas2D commented Apr 27, 2026

Closes #244.

Summary

The local dev cluster reliably fills its VM disk after a week or two of normal use, taking pods (and occasionally the host laptop) down with it. Three independent mechanisms all feed the same failure mode — this PR addresses each.

  • Image accumulation. cluster:install and the three cluster:build-* tasks now run k3s crictl rmi --prune after the new images are pinned to a running pod, so old :latest layers no longer pile up in containerd.
  • Unrotated container logs. k3s now reads /etc/rancher/k3s/config.yaml with container-log-max-size: 10Mi and container-log-max-files: 3 (cap ~30 MiB per pod). Lima provision writes it on new VMs; cluster:install ensures it on existing VMs (idempotent — restart fires once on the upgrade run).
  • Oversized / leaked PVCs. Mount gains an optional size field validated via resource.ParseQuantity, plumbed through the agent-template ConfigMap and Helm values (homeMountSize per template). Defaults: claude-code 5Gi, pi-agent / google-workspace 2Gi, code-guardian 10Gi (clones repos). Empty falls back to the historical 10Gi.
  • Orphan PVCs. New ReconcileOrphanPVCs in the controller runs every 10 min, deletes any humr.ai/instance-labeled PVC whose instance ConfigMap is gone (re-reads from the API to dodge create-races). Covers controller crashes mid-delete and out-of-band kubectl delete cm.
  • Manual escape hatch. New mise run cluster:reclaim task: prunes images, deletes Succeeded/Failed pods cluster-wide, surfaces orphan PVCs, prints disk usage on the VM.

Production storage provisioning is unchanged beyond making the size knob configurable; today's templates that omit size still get the 10Gi default.

Test plan

  • mise run test — controller, api-server, ui, agent-runtime suites all green (added MountSize parsing tests, per-PVC sizing test, orphan-GC test).
  • mise run check — eslint, go vet, tsc, helm lint, kubeconform render all clean.
  • Inspected rendered helm template output: size: "5Gi" appears under /home/agent for the default template.
  • End-to-end on a fresh dev VM: mise run cluster:install, run cluster:build-agent 5×, confirm image count and disk usage stay bounded.
  • On a pre-existing dev VM (no kubelet args): mise run cluster:install patches /etc/rancher/k3s/config.yaml, restarts k3s once, subsequent runs are no-ops.
  • Create an instance with size: 1Gi, confirm PVC is 1Gi; kubectl delete cm the instance directly, wait ≤ 10 min, confirm PVC is GC'd and controller logs the GC line.
  • mise run cluster:reclaim on a cluster with a stopped pod and an orphan PVC: pod deleted, orphan reported.

Three independent disk-exhaustion mechanisms in the local dev cluster
all routed to the same failure mode (filled VM disk → unstartable pods,
host crash). Tackle each:

- Auto-prune unreferenced containerd images after every cluster:install
  and cluster:build-* via `k3s crictl rmi --prune`. Old `:latest` layers
  no longer accumulate.
- Cap kubelet container logs at 10Mi × 3 files per pod via a k3s
  /etc/rancher/k3s/config.yaml. Lima provision writes it on new VMs;
  cluster:install ensures it on existing VMs (idempotent restart).
- Make per-mount PVC size configurable via a new optional `size` field
  on Mount, plumbed through the agent template ConfigMap and Helm
  values (homeMountSize per template). Defaults: claude-code 5Gi,
  pi-agent / google-workspace 2Gi, code-guardian 10Gi (clones repos).
  Empty preserves the historical 10Gi default.
- Add a periodic orphan-PVC GC in the controller that removes PVCs
  whose instance ConfigMap is gone — covers controller crashes
  mid-delete and out-of-band kubectl removals. Runs every 10m and
  re-reads from the API to avoid racing instance creation.
- New cluster:reclaim mise task as a manual escape hatch.

Signed-off-by: Tomas Dvorak <toomas2d@gmail.com>
@xjacka
Copy link
Copy Markdown
Contributor

xjacka commented Apr 27, 2026

🛡️ Humr — Code Review

PR #345: fix(cluster): bound dev cluster disk usage (#244)

Author: Tomas2D | Branch: fix/dev-cluster-disk-bounds → main | Changes: +303 −3 (15 files)

Summary

Addresses persistent disk exhaustion in dev clusters (issue #244) by: adding configurable PVC sizes to Helm templates, setting kubelet container log rotation, adding periodic orphan PVC GC in the controller, and a cluster:reclaim task for manual recovery.

Findings

  • 🟡 Warning: resource.MustParse(m.Size) panics on an invalid Kubernetes quantity string. The comment says "validation happens in ParseAgentSpec" but that code isn't in this diff — if that validation is absent or incomplete, any malformed homeMountSize in values.yaml will crash the controller. (packages/controller/pkg/reconciler/resources.go)
  • 🟡 Warning: cluster:install writes /etc/rancher/k3s/config.yaml unconditionally — if an existing VM already has custom kubelet args in that file, they will be silently overwritten. (deploy/tasks.toml)
  • Looks good: Orphan PVC GC re-reads ConfigMaps from the API server (not the informer cache), correctly avoiding the create-PVC-before-finalize race.
  • Looks good: Tests cover both configurable PVC size and orphan sweep, including the "live PVC must be retained" invariant.
  • Looks good: Switching from fmt.Printf to slog is an improvement.

Verdict

APPROVE — well-designed fix with solid test coverage; warnings are minor dev-environment risks.


Review by Humr · automated code guardian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

[Bug]: Local dev cluster fills its disk and can crash the host laptop

3 participants