All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Support for Kubernetes 1.32 with
InPlacePodVerticalScalingalpha feature gate; the operator now falls back to the deprecatedpod.Status.Resizefield for resize status on clusters without the 1.33+ pod conditions - Top-level
safetyObservationPeriodfield onUpdateStrategyfor configuring post-resize safety watch duration (default 5m, minimum 1m); takes precedence overcanary.observationPeriodand works in all modes - Early OOMKill and crash loop detection during safety observation period: critical events trigger immediate revert without waiting for the full observation period
kubectl attune explainnow displays the effective observation period with source tracking- Configurable
rateWindowfield for CPU PromQL queries; no longer hardcoded to[5m], now tracksqueryStepby default - Effective cooldown with backoff multiplier exposed in policy status
- Recommendation staleness detection with
LastDataTimeandStalefields; stale recommendations block resize execution StaleRecommendationsTotalmetric for tracking Prometheus degradationScheduleBlockedstatus condition when outside the configured resize windowSCHEDULEcolumn inkubectl attune statusoutput- Per-policy namespace/name labels on
ReconcileDurationmetric - Per-policy reconcile duration panel in Grafana dashboard (p99/p50 by namespace and policy)
- ReplicaSet as a supported target workload kind with adapter, RBAC, and Helm clusterrole
- Cross-namespace Secret reference rejection in webhook validation
AttuneHighRevertRatePrometheusRule alert in Helm chart- Configurable
burstSensitivityper resource: controls how much burst detection inflates recommendations (default 0.1, set 0 to disable) - Canary auto-promotion resets on spec change: editing a policy restarts the observation cycle so new configuration is re-validated
attune_burst_factorPrometheus metric and Grafana dashboard panel showing burst detection multiplier per workload- Burst detection now influences recommendations via logarithmic safety-margin boost
- Canary auto-promotion: when
autoPromote: true, the operator automatically promotes to full fleet resize after the observation period passes without safety violations - VPA conflict detection E2E test (Chainsaw scenario with inline CRD)
- OOMKill safety revert Go E2E test (uses stress-ng for reliable OOMKill trigger)
- Helm values schema validation (
values.schema.json) for catching typos at install time - Pending workloads column in
kubectl attune statusoutput - Secret name and key context in Prometheus auth failure messages
- Go E2E tests for bearer-token Secret rotation and recommendations without live pods
- Structured-output test coverage for kubectl plugin (
-o json,-o yaml) - Documentation for running the full Go E2E suite locally
- V(1) debug log when a resize is skipped because the container is already at the target resources
- Initial sizing webhook: Mutating admission webhook sets pod resource requests/limits at creation time based on existing AttunePolicy recommendations, eliminating the "deploy with bad defaults" gap. Requires namespace label
attune.io/initial-sizing=enabledandinitialSizing: trueon the policy. Safety:failurePolicy: Ignore, confidence threshold 0.5, stale check. - Directional change caps:
maxIncreasePercent(default 50%) andmaxDecreasePercent(default 30%) in ResourceConfig for asymmetric per-step caps (memory decreases are riskier than CPU increases) - Memory-from-CPU derivation:
memoryFromCpuRatioin ResourceConfig derives memory recommendation from CPU (e.g.,"2.0"for JVM heap-bound workloads), skipping Prometheus memory queries - Wizard
createandpromoteflows now prompt for initial sizing when mode is Auto, OneShot, or Canary - SLO-based guardrails:
updateStrategy.sloGuardrails[]defines application-level PromQL checks (latency, error rate) evaluated after each resize during the safety observation period. Breaching a threshold triggers automatic revert. Supports template variables for namespace, workload, and pod name. - VPA recommendation consumption:
metricsSource.vpaconsumes existing VerticalPodAutoscaler recommendations as an alternative to Prometheus queries, bridging VPA-only clusters into Attune's in-place resize engine - GitOps diff command:
kubectl attune diffoutputs resource change recommendations in YAML diff format for GitOps workflows (ArgoCD, Flux). Supports-o yamlstructured output. - spec.paused: Boolean field on
AttunePolicySpecthat halts all reconciliation (metrics collection, recommendations, resizes) without reverting existing resizes. The operator setsReady=Falsewithreason=Paused. Modeled after Prometheus Operator and Fluxspec.suspend. - Webhook warnings for nonsensical config: 13 admission-time warnings detect ineffective settings (e.g., canary config in non-canary mode, SLO guardrails with VPA source, resize-only settings in Observe/Recommend mode)
- Runtime K8s events: 31 warning/event types (up from 3) for silent controller behaviors:
StaleRecommendation,CooldownActive,HPAConflict,VPAConflict,ConfigClamped,ExportFailed,ResizeSkipped,BudgetExhausted, and more. All recurring events use 1-hour deduplication to prevent log spam. - Warning suppression:
attune.io/suppress-warningsannotation accepts a comma-separated list of event reasons to suppress (e.g.,HPAConflict,ConfigClamped)
- BREAKING:
safetyMarginfield renamed tooverheadwith percentage semantics. Old multiplier values must be converted:(old - 1) * 100(e.g.,safetyMargin: "1.2"becomesoverhead: "20"). Defaults changed from"1.2"/"1.3"to"20"/"30". Validation bounds changed from(0, 10.0]to[0, 900]. - BREAKING:
maxCpuChangePercentandmaxMemoryChangePercentmoved fromupdateStrategytocpu/memoryasmaxChangePercent. Groups all per-resource recommendation parameters in one place. - BREAKING:
updateStrategy.modefield renamed toupdateStrategy.typeto align with Kubernetes core conventions - BREAKING:
bounds.min/bounds.maxrenamed tominAllowed/maxAllowed,InPlaceOrEvictrenamed toInPlaceOrRecreate,excludeContainersrenamed toexcludedContainers - Shorter requeue interval during data collection phase for faster initial recommendation generation
canary.percentageCRD minimum changed from 0 to 1 (a 0% canary is meaningless)rateWindowis inheritable viaAttuneDefaultsandAttuneNamespaceDefaults- Deployment-owned ReplicaSets are filtered from target discovery to prevent double-resizing
- Reconcile predicate filters out self-triggered status and metadata updates, reducing kube-apiserver load by eliminating 2-3x reconcile amplification per cycle
- Recommendations no longer require live pods; historical Prometheus data is sufficient for recommend-only flows
- Secret-backed bearer tokens are refreshed on every reconcile instead of being cached until TTL expiry
- Collector cache identity uses hashed token values instead of plain presence markers
- Extracted
buildCollectorOptionshelper from the mainReconcilemethod - Documentation now clarifies that
minimumDataPointscounts Prometheus range-query samples, so wall-clock recommendation timing depends onqueryStep - Reserved Prometheus query parameters (
query,start,end,step,time,timeout) are now rejected so operator-managed request keys cannot be overridden
golang.org/x/netupdated to v0.55.0 to fix GO-2026-5026 (Punycode validation vulnerability inidna)- Trivy image scan CI failure on runners without BuildKit/buildx; the step now strips BuildKit-only Dockerfile directives and builds natively with the legacy builder
make docker-buildnow setsDOCKER_BUILDKIT=1so the Dockerfile's--platform=$BUILDPLATFORMresolves on legacy Docker CLIskubectl attune explainwas missingsafetyObservationPeriodmerge from namespace/cluster defaults, showing wrong effective valueStaleRecommendationsTotalmetric label mismatch between registration and increment- E2E test flakes: OOMKill timeout, GuaranteedQoS queryStep, ScaleUp timeout, Chainsaw poll intervals, rateWindow regression with short queryStep
- Status race condition where concurrent reconciles could reset
status.workloads.resizedto 0 after a successful resize; Resized count is now derived from resize history entries which survive optimistic concurrency conflicts attune_throttle_deferred_totalmetric now appears in the Grafana dashboard (was the only unvisualized operator metric)AttuneNamespaceDefaultsCRD missing fromconfig/crd/kustomization.yaml; kustomize deployments now include it- Bearer-token cache prefix collision when one Prometheus address is a prefix of another
make test-localnow cleans up the k3d cluster even on mid-run failures- Gitleaks PATH resolution on self-hosted runners
prometheus-unreachableE2E test now accepts eitherInsufficientDataorPrometheusUnavailablereason, fixing a flake where the first reconcile sets one reason and subsequent reconciles set another- RevertPod now retries on 409 Conflict (matching ResizePod); previously a conflict during revert left the pod at unsafe resource levels until the next reconcile
- Datadog and CloudWatch collector caches now share the same TTL eviction, capacity bounds, and race-safe LoadOrStore as the Prometheus collector cache; previously they could leak memory and create duplicate collectors
- Startup boost expiry pre-check now includes memory values, preventing node allocatable safety check bypass when namespaces have memory LimitRange constraints
- Annotation cleanup in safety observation now retries on 409 Conflict (up to 3 attempts), matching the persistResizeAnnotations retry pattern
- Multi-container sequential resize: annotation persist now retries on 409 Conflict instead of reverting the second container
- Memory limit clamp for K8s v1.33: in-place memory limit decreases are skipped when the container's resize policy is
NotRequired, preventing API server rejection - Guaranteed QoS preservation with memory limit clamp: the clamp is applied before the QoS check so that Guaranteed pods are not incorrectly resized into Burstable
helm-unittestdownload now uses dynamic OS/arch detection instead of hardcodedlinux-amd64- OOMKill E2E test:
RestartContainermemory resize policy hides OOM evidence by overwritingLastTerminationStateon resize-induced restarts; test now usesNotRequiredpolicy - Safety revert path now applies K8s v1.33 memory limit clamp (
ClampMemoryLimitForPolicy), preventing revert failures when memory limits would decrease withNotRequiredresize policy - CI image builds switched from Docker/BuildKit to
ko, eliminating Docker daemon dependency and containerd storage race conditions on macOS self-hosted runners - k3d image import retry loops with pre-cleanup for macOS containerd storage flakes
- Confidence factor formula
(1+M/C)^Eproduced a 4x multiplier at maximum confidence (7 days of data), inflating all recommendations well beyond the user's configured overhead. A workload with P95=200m andoverhead: "20"converged to ~960m instead of the expected ~240m. Replaced with1 + M*(1-C)^Ewhich gives factor=1.0 at full confidence and up to 1.8x at minimum confidence. memoryFromCpuRatiovalues above 10.0 (e.g.,"16.0"for in-memory databases) were silently rejected by the sharedparseFloat64parser, disabling the feature without any error or warning. The ratio now uses a dedicated parser with a 1000.0 ceiling.