fix(cluster): bound dev cluster disk usage (#244) by Tomas2D · Pull Request #345 · kagenti/humr

Tomas2D · 2026-04-27T16:19:06Z

Closes #244.

Summary

The local dev cluster reliably fills its VM disk after a week or two of normal use, taking pods (and occasionally the host laptop) down with it. Three independent mechanisms all feed the same failure mode — this PR addresses each.

Image accumulation. cluster:install and the three cluster:build-* tasks now run k3s crictl rmi --prune after the new images are pinned to a running pod, so old :latest layers no longer pile up in containerd.
Unrotated container logs. k3s now reads /etc/rancher/k3s/config.yaml with container-log-max-size: 10Mi and container-log-max-files: 3 (cap ~30 MiB per pod). Lima provision writes it on new VMs; cluster:install ensures it on existing VMs (idempotent — restart fires once on the upgrade run).
Oversized / leaked PVCs. Mount gains an optional size field validated via resource.ParseQuantity, plumbed through the agent-template ConfigMap and Helm values (homeMountSize per template). Defaults: claude-code 5Gi, pi-agent / google-workspace 2Gi, code-guardian 10Gi (clones repos). Empty falls back to the historical 10Gi.
Orphan PVCs. New ReconcileOrphanPVCs in the controller runs every 10 min, deletes any humr.ai/instance-labeled PVC whose instance ConfigMap is gone (re-reads from the API to dodge create-races). Covers controller crashes mid-delete and out-of-band kubectl delete cm.
Manual escape hatch. New mise run cluster:reclaim task: prunes images, deletes Succeeded/Failed pods cluster-wide, surfaces orphan PVCs, prints disk usage on the VM.

Production storage provisioning is unchanged beyond making the size knob configurable; today's templates that omit size still get the 10Gi default.

Test plan

mise run test — controller, api-server, ui, agent-runtime suites all green (added MountSize parsing tests, per-PVC sizing test, orphan-GC test).
mise run check — eslint, go vet, tsc, helm lint, kubeconform render all clean.
Inspected rendered helm template output: size: "5Gi" appears under /home/agent for the default template.
End-to-end on a fresh dev VM: mise run cluster:install, run cluster:build-agent 5×, confirm image count and disk usage stay bounded.
On a pre-existing dev VM (no kubelet args): mise run cluster:install patches /etc/rancher/k3s/config.yaml, restarts k3s once, subsequent runs are no-ops.
Create an instance with size: 1Gi, confirm PVC is 1Gi; kubectl delete cm the instance directly, wait ≤ 10 min, confirm PVC is GC'd and controller logs the GC line.
mise run cluster:reclaim on a cluster with a stopped pod and an orphan PVC: pod deleted, orphan reported.

Three independent disk-exhaustion mechanisms in the local dev cluster all routed to the same failure mode (filled VM disk → unstartable pods, host crash). Tackle each: - Auto-prune unreferenced containerd images after every cluster:install and cluster:build-* via `k3s crictl rmi --prune`. Old `:latest` layers no longer accumulate. - Cap kubelet container logs at 10Mi × 3 files per pod via a k3s /etc/rancher/k3s/config.yaml. Lima provision writes it on new VMs; cluster:install ensures it on existing VMs (idempotent restart). - Make per-mount PVC size configurable via a new optional `size` field on Mount, plumbed through the agent template ConfigMap and Helm values (homeMountSize per template). Defaults: claude-code 5Gi, pi-agent / google-workspace 2Gi, code-guardian 10Gi (clones repos). Empty preserves the historical 10Gi default. - Add a periodic orphan-PVC GC in the controller that removes PVCs whose instance ConfigMap is gone — covers controller crashes mid-delete and out-of-band kubectl removals. Runs every 10m and re-reads from the API to avoid racing instance creation. - New cluster:reclaim mise task as a manual escape hatch. Signed-off-by: Tomas Dvorak <toomas2d@gmail.com>

xjacka · 2026-04-27T16:26:56Z

🛡️ Humr — Code Review

PR #345: fix(cluster): bound dev cluster disk usage (#244)

Author: Tomas2D | Branch: fix/dev-cluster-disk-bounds → main | Changes: +303 −3 (15 files)

Summary

Addresses persistent disk exhaustion in dev clusters (issue #244) by: adding configurable PVC sizes to Helm templates, setting kubelet container log rotation, adding periodic orphan PVC GC in the controller, and a cluster:reclaim task for manual recovery.

Findings

🟡 Warning: resource.MustParse(m.Size) panics on an invalid Kubernetes quantity string. The comment says "validation happens in ParseAgentSpec" but that code isn't in this diff — if that validation is absent or incomplete, any malformed homeMountSize in values.yaml will crash the controller. (packages/controller/pkg/reconciler/resources.go)
🟡 Warning: cluster:install writes /etc/rancher/k3s/config.yaml unconditionally — if an existing VM already has custom kubelet args in that file, they will be silently overwritten. (deploy/tasks.toml)
✅ Looks good: Orphan PVC GC re-reads ConfigMaps from the API server (not the informer cache), correctly avoiding the create-PVC-before-finalize race.
✅ Looks good: Tests cover both configurable PVC size and orphan sweep, including the "live PVC must be retained" invariant.
✅ Looks good: Switching from fmt.Printf to slog is an improvement.

Verdict

APPROVE — well-designed fix with solid test coverage; warnings are minor dev-environment risks.

Review by Humr · automated code guardian

rubambiza added this to Kagenti Issue Prioritization Apr 27, 2026

github-project-automation Bot moved this to Backlog in Kagenti Issue Prioritization Apr 27, 2026

Tomas2D requested a review from matoushavlena April 27, 2026 16:19

This was referenced Apr 27, 2026

Weekly Report 2026-04-20 kagenti/kagenti#1286

Open

Weekly Report 2026-04-27 kagenti/kagenti#1348

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cluster): bound dev cluster disk usage (#244)#345

fix(cluster): bound dev cluster disk usage (#244)#345
Tomas2D wants to merge 1 commit intomainfrom
fix/dev-cluster-disk-bounds

Tomas2D commented Apr 27, 2026

Uh oh!

xjacka commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Tomas2D commented Apr 27, 2026

Summary

Test plan

Uh oh!

xjacka commented Apr 27, 2026

PR #345: fix(cluster): bound dev cluster disk usage (#244)

Summary

Findings

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants