Attune: Kubernetes operator for in-place pod resource right-sizing (VPA replacement). Requires Kubernetes 1.32+ (In-Place Pod Resize; 1.32 alpha with feature gate, 1.33+ beta enabled by default). Built with Go 1.26, controller-runtime v0.24.1, Kubebuilder v4, K8s API v0.36.1.
Naming convention: "Attune" (capitalized) in prose and documentation.
attune (lowercase) in code, packages, namespaces, Prometheus metrics
(attune_resize_total), CLI commands (kubectl attune), API groups
(attune.io), Helm chart names, Docker images, and URLs.
- Install deps:
go mod download - Build:
make build - Build plugin:
make build-plugin - Build image:
make docker-build IMG=attune:dev - Test (unit):
make test - Test (single pkg):
go test ./internal/resize/... -race -count=1 - Test (integration):
make test-integration - Test (E2E Chainsaw):
NO_COLOR=1 make test-e2e(requires k3d cluster; NO_COLOR prevents raw ANSI codes in agent output) - Test (E2E Go):
make test-e2e-go(requires k3d cluster with operator + Prometheus) - Test (E2E smoke):
make test-e2e-smoke(requires deployed k3d/Kind cluster with operator + Prometheus) - Test (fuzz):
make test-fuzz - Test (bench):
make test-bench - Lint:
make lint - Lint + fix:
make lint-fix - Format:
make fmt - Generate CRDs/RBAC:
make manifests - Generate deepcopy:
make generate - Helm chart docs:
make helm-docs-gen - Helm chart tests:
make helm-unittest - Helm lint + template validation:
make helm-lint - Doc defaults consistency check:
make verify-doc-defaults - Fast pre-commit checks:
make verify-quick(no integration tests or govulncheck) - All CI checks locally:
make verify - Clean build artifacts:
make clean - Local cluster (k3d):
make k3d-create && make k3d-deploy IMG=attune:e2e - Local cluster (Kind):
make kind-create && make kind-deploy IMG=attune:e2e - Full local test (auto-provisions k3d):
make test-local - Local smoke test (auto-provisions k3d):
make test-local-smoke - E2E tests:
make test-e2e(requires local cluster with operator deployed)
api/v1alpha1/- CRD type definitions (AttunePolicy, AttuneDefaults)cmd/manager/- Operator entry pointcmd/kubectl-attune/- kubectl plugininternal/controller/- Reconciler (core business logic)internal/metrics/- Metrics collection (Prometheus, Datadog, CloudWatch), QueryBuilder interface, rate limitinginternal/recommendation/- Composable estimator chain (percentile, margin, confidence, bounds, change filter)internal/resize/- In-place pod resize engine via /resize subresourceinternal/safety/- Post-resize safety observation and rollbackinternal/conflict/- HPA conflict detectioninternal/webhook/- Admission webhooks (defaulting + validation)internal/operatormetrics/- Operator-level Prometheus metrics (init-registered)internal/validation/- Shared validation (Prometheus address SSRF checks)internal/throttle/- Shared throttle checker interface (breaks import cycle)pkg/defaults/- Shared default-value and merge logic (used by controller + kubectl plugin)config/- Kustomize manifests (CRDs, RBAC, manager deployment)charts/attune/- Helm chart with cert-manager webhook supporttest/integration/- envtest-based integration teststest/e2e/- Chainsaw E2E test scenariosdocs/- MkDocs documentation site
Use these exact aliases; the linter rejects alternatives:
corev1 "k8s.io/api/core/v1"
appsv1 "k8s.io/api/apps/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
apierrors "k8s.io/apimachinery/pkg/api/errors"
ctrl "sigs.k8s.io/controller-runtime"Use logr structured logging exclusively. fmt.Print and fmt.Fprint are
forbidden by the linter (except in cmd/kubectl-attune/).
Use resource.ParseQuantity() (returns error) instead of resource.MustParse()
(panics). Use DecimalSI format for CPU, BinarySI for memory. Use Go time.Duration
for all durations (e.g., 168h not 7d).
controller-runtime v0.24.x uses typed generic interfaces. Register webhooks with:
// AttunePolicy: defaulting + validation
ctrl.NewWebhookManagedBy(mgr, &attunev1alpha1.AttunePolicy{}).
WithDefaulter(&webhook.AttunePolicyDefaulter{}).
WithValidator(&webhook.AttunePolicyValidator{}).
Complete()
// AttuneDefaults: validation only (costPricing fields)
ctrl.NewWebhookManagedBy(mgr, &attunev1alpha1.AttuneDefaults{}).
WithValidator(&webhook.AttuneDefaultsValidator{}).
Complete()Kubebuilder markers like +kubebuilder:validation:Minimum=1 generate
OpenAPI schema constraints in the CRD. These are enforced by the API
server at admission time, before webhooks run. A zero value that
violates a CRD-level minimum is rejected even if the webhook would
accept it. When writing tests that create CRs, always respect CRD-level
constraints; webhook-level logic cannot override them.
The /resize subresource is not available via the controller-runtime client.
Use a typed kubernetes.Clientset and call UpdateResize():
clientset.CoreV1().Pods(ns).UpdateResize(ctx, name, pod, metav1.UpdateOptions{})Wrap UpdateResize in retry.RetryOnConflict (kubelet and concurrent
container resizes bump resourceVersion).
See internal/resize/engine.go ResizePod().
K8s v1.33 memory limit restriction: Kubernetes v1.33 forbids decreasing a
container's memory limit in-place when the resize policy is NotRequired.
The operator handles this via ClampMemoryLimitForPolicy in
internal/resize/engine.go. K8s v1.34+ relaxed this restriction.
Run make manifests after changing CRD types or RBAC markers. Run
make generate after changing API types. Commit the generated output.
Any resource accessed via the controller-runtime client (r.Get(), r.List(),
r.Update()) goes through the informer cache by default. The cache needs
list and watch RBAC to start its reflector. If you add a new r.Get()
call for a resource type, check:
- Does the RBAC marker include
listandwatch? If not, add them. - Is the resource in
DisableFor(cmd/manager/main.go)? If yes, it bypasses the cache and only needsget.
When changing a client call's verb (e.g., r.Update() to r.Patch(),
or r.Get() to r.List()), the RBAC marker must also be updated. The
code compiles without the RBAC change; the failure only appears at runtime
as a "forbidden" error, which controller-runtime retries with exponential
backoff. This silently burns through timeouts and is hard to diagnose
without reading operator logs.
After changing RBAC markers, update three places:
- The kubebuilder marker in
internal/controller/ config/rbac/role.yaml(runmake manifests)charts/attune/templates/clusterrole.yaml+ its test
Currently, Secrets are the only resource in DisableFor (get-only is safe).
All other resources accessed via the client need list/watch.
Fields that should be overridable by AttuneDefaults must use
pointer types (*int32, *bool, *metav1.Duration) so nil=unset
is distinguishable from zero/false. Update all 7 locations:
api/v1alpha1/attunepolicy_types.go- Add*Tfield withjson:"name,omitempty"and// +optionalapi/v1alpha1/defaults.go- AddDefaultXxxconstantpkg/defaults/defaults.goApplyBuiltInDefaults()- Add nil check + default assignmentpkg/defaults/defaults.goMergeDefaults()- Add merge clause (covers both controller and kubectl plugin)internal/webhook/validation.go- Add validation if needed- Run
make manifests && make generateto regenerate CRD + deepcopy cmd/kubectl-attune/main.goprintEffectiveValues()- Add display line sokubectl attune explainshows the field
If the field also belongs in AttuneDefaults, add it to
api/v1alpha1/attunedefaults_types.go as well.
helm-docs reads # -- comments from values.yaml to generate README
parameter tables. Multi-line descriptions must use # -- only on the
first line; continuation lines use # without --:
# -- First line of the description.
# Continuation text on the second line.
# More continuation text.
someValue: "default"Using # -- on every line causes helm-docs to treat each line as a
separate parameter description, producing garbled output.
The repository enforces semantic PR titles via .github/workflows/pr-title.yaml (the amannn/action-semantic-pull-request action).
- Allowed types:
feat,fix,docs,ci,refactor,test,chore,perf,build,revert. - The subject (text after
type:) must start with a lowercase letter (subjectPattern: ^[a-z].+$).- Good:
fix: e2e nightly RealisticLoad timeout + safe cache keys for secrets (no SHA256) - Bad:
fix: E2E nightly ...(capital E fails the regex and blocks the PR immediately).
- Good:
- The check runs on PR open/edit/synchronize and validates the PR title (and frequently the head commit message).
- Dependabot PRs are automatically exempted by the workflow.
When creating branches, commits, or PRs, make the first line a valid semantic title so the gate passes on the first attempt. This avoids immediate CI failures and repeated title edits.
MkDocs strict mode rejects relative links that resolve outside the docs/
directory. When referencing files elsewhere in the repo (e.g., charts/,
scripts/), use absolute GitHub URLs instead of relative paths:
<!-- BAD: relative path outside docs/ — MkDocs strict mode rejects this -->
[Helm README](../../charts/attune/README.md)
<!-- GOOD: absolute GitHub URL -->
[Helm README](https://github.com/attune-io/attune/tree/main/charts/attune#prometheusrule)- Framework:
testify(assert/require) - Write table-driven tests for all logic
- Coverage threshold: 80% on
./internal/...(CI enforced) - Generated files (
zz_generated.deepcopy.go) are excluded from coverage - CI uses
gotestsumwith--rerun-failsfor flaky retry and JUnit XML reports - Run with
-raceflag - Use
kubefake.NewSimpleClientset()to test resize operations - Use
fake.NewClientBuilder()for controller-runtime client mocking - Integration tests use envtest (build tag:
integration) - E2E tests use Chainsaw v0.2.15 on k3d or Kind clusters (K8s 1.32, 1.33, 1.34, 1.35 matrix in CI)
- E2E tests that modify CRs mid-test must use a refetch/retry loop to handle
optimistic concurrency conflicts (the operator reconciles the same object
concurrently, causing
the object has been modifiederrors on update) - Go E2E tests (
test/e2e-go/) must includet.Parallel()as the first line. Every test creates a unique namespace viauniqueNS(), so they are fully isolated. Withoutt.Parallel(), 13 tests run sequentially (~12 min); with it, they run concurrently (~2 min, bounded by OOMKill at 127s). - E2E test policies must use
Cooldown: 1m(the minimum) to avoid long requeue delays during data collection. - E2E test pods should use Burstable QoS (requests only, no CPU/memory limits) unless testing QoS behavior specifically. Guaranteed QoS pods are harder to schedule when 13 parallel tests compete for ~4 allocatable CPUs on the k3d node. Keep CPU requests at or below 300m per test pod.
- Runs on GitHub-hosted runners (
ubuntu-latest) by default - Fuzz tests: 30s time-based per target (coverage-guided, not iteration count)
- E2E Nightly runs the full K8s version matrix (1.32, 1.33, 1.34, 1.35) sequentially; each version creates a fresh k3d cluster
- Concurrency groups use
cancel-in-progress: falseon main; PRs targeting main will not cancel in-flight CI runs
- Never commit secrets, API keys, or
.envfiles (gitleaks runs in CI) - Run
make manifests && make generatebefore committing CRD/API changes - Run
make verifybefore committing (covers lint, test, helm-docs, CRD freshness) - After running
make deploy,make k3d-deploy, ormake kind-deploy, restoregit checkout config/manager/kustomization.yamlbefore committing (kustomize edit set image mutates this file) - Ask before adding new dependencies
- Ask before destructive cluster operations (delete namespaces, CRDs)
- The operator manages live pod resources; always test resize logic on a
local cluster (
make k3d-deployormake kind-deploy) before pushing changes to resize or safety packages