fix(cluster): bound dev cluster disk usage (#244)#345
Open
Conversation
Three independent disk-exhaustion mechanisms in the local dev cluster all routed to the same failure mode (filled VM disk → unstartable pods, host crash). Tackle each: - Auto-prune unreferenced containerd images after every cluster:install and cluster:build-* via `k3s crictl rmi --prune`. Old `:latest` layers no longer accumulate. - Cap kubelet container logs at 10Mi × 3 files per pod via a k3s /etc/rancher/k3s/config.yaml. Lima provision writes it on new VMs; cluster:install ensures it on existing VMs (idempotent restart). - Make per-mount PVC size configurable via a new optional `size` field on Mount, plumbed through the agent template ConfigMap and Helm values (homeMountSize per template). Defaults: claude-code 5Gi, pi-agent / google-workspace 2Gi, code-guardian 10Gi (clones repos). Empty preserves the historical 10Gi default. - Add a periodic orphan-PVC GC in the controller that removes PVCs whose instance ConfigMap is gone — covers controller crashes mid-delete and out-of-band kubectl removals. Runs every 10m and re-reads from the API to avoid racing instance creation. - New cluster:reclaim mise task as a manual escape hatch. Signed-off-by: Tomas Dvorak <toomas2d@gmail.com>
Contributor
|
🛡️ Humr — Code Review PR #345: fix(cluster): bound dev cluster disk usage (#244)Author: Tomas2D | Branch: fix/dev-cluster-disk-bounds → main | Changes: +303 −3 (15 files) SummaryAddresses persistent disk exhaustion in dev clusters (issue #244) by: adding configurable PVC sizes to Helm templates, setting kubelet container log rotation, adding periodic orphan PVC GC in the controller, and a Findings
VerdictAPPROVE — well-designed fix with solid test coverage; warnings are minor dev-environment risks. Review by Humr · automated code guardian |
This was referenced Apr 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #244.
Summary
The local dev cluster reliably fills its VM disk after a week or two of normal use, taking pods (and occasionally the host laptop) down with it. Three independent mechanisms all feed the same failure mode — this PR addresses each.
cluster:installand the threecluster:build-*tasks now runk3s crictl rmi --pruneafter the new images are pinned to a running pod, so old:latestlayers no longer pile up in containerd./etc/rancher/k3s/config.yamlwithcontainer-log-max-size: 10Miandcontainer-log-max-files: 3(cap ~30 MiB per pod). Lima provision writes it on new VMs;cluster:installensures it on existing VMs (idempotent — restart fires once on the upgrade run).Mountgains an optionalsizefield validated viaresource.ParseQuantity, plumbed through the agent-template ConfigMap and Helm values (homeMountSizeper template). Defaults: claude-code 5Gi, pi-agent / google-workspace 2Gi, code-guardian 10Gi (clones repos). Empty falls back to the historical 10Gi.ReconcileOrphanPVCsin the controller runs every 10 min, deletes anyhumr.ai/instance-labeled PVC whose instance ConfigMap is gone (re-reads from the API to dodge create-races). Covers controller crashes mid-delete and out-of-bandkubectl delete cm.mise run cluster:reclaimtask: prunes images, deletes Succeeded/Failed pods cluster-wide, surfaces orphan PVCs, prints disk usage on the VM.Production storage provisioning is unchanged beyond making the size knob configurable; today's templates that omit
sizestill get the 10Gi default.Test plan
mise run test— controller, api-server, ui, agent-runtime suites all green (addedMountSizeparsing tests, per-PVC sizing test, orphan-GC test).mise run check— eslint,go vet,tsc,helm lint, kubeconform render all clean.helm templateoutput:size: "5Gi"appears under/home/agentfor the default template.mise run cluster:install, runcluster:build-agent5×, confirm image count and disk usage stay bounded.mise run cluster:installpatches/etc/rancher/k3s/config.yaml, restarts k3s once, subsequent runs are no-ops.size: 1Gi, confirm PVC is 1Gi;kubectl delete cmthe instance directly, wait ≤ 10 min, confirm PVC is GC'd and controller logs the GC line.mise run cluster:reclaimon a cluster with a stopped pod and an orphan PVC: pod deleted, orphan reported.