diff --git a/keps/sig-autoscaling/5053-hpa-fallback/README.md b/keps/sig-autoscaling/5053-hpa-fallback/README.md new file mode 100644 index 00000000000..16b18c1b603 --- /dev/null +++ b/keps/sig-autoscaling/5053-hpa-fallback/README.md @@ -0,0 +1,893 @@ + +# KEP-5053: Fallback for HPA on failure to retrieve metrics + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +The [Horizontal Pod Autoscaler (HPA)][] relies on the controller manager to +fetch metrics from either the resource metrics API (for per-pod resource metrics) +or the custom metrics API (for other types of metrics). When these APIs experience +downtime, the HPA becomes unable to make scaling decisions, potentially leaving +workloads unmanaged. + +This proposal introduces a new configuration parameter for the HPA, enabling +users to define behavior in the event of metric retrieval failures. For example, +users can opt to scale the target resource to the maximum number of replicas +specified in the HPA, ensuring safer operation during metrics unavailability. + +[Horizontal Pod Autoscaler (HPA)]: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ + +## Motivation + +The Horizontal Pod Autoscaler (HPA) is a critical component for scaling Kubernetes +workloads based on resource utilization or custom metrics. However, the current +implementation depends entirely on the availability of the resource metrics API +or custom metrics API to make scaling decisions. If these APIs experience +downtime or degradation, the HPA cannot take any scaling actions, leaving +workloads potentially overprovisioned, underprovisioned, or entirely unmanaged. + +In contrast, other autoscalers like [KEDA][] already provide mechanisms to define +fallback strategies in the event of metric retrieval failures. These strategies +mitigate the impact of API unavailability, enabling the autoscaler to maintain +a functional scaling strategy even when metrics are temporarily inaccessible. + +By allowing users to configure fallback behavior in HPA, this proposal aims to +reduce the criticality of the metrics APIs and improve the overall robustness +of the autoscaling system. This change allows users to define safe scaling +actions, both as scaling to a predefined maximum or holding the current scale +(current behavior), ensuring workloads remain operational and better aligned +with user-defined requirements during unexpected disruptions. + +Additionally, the community has also expressed interest in addressing this +limitation in the past. ([#109214][]) + +[KEDA]: https://keda.sh/docs/2.15/reference/scaledobject-spec/#fallback +[#109214]: https://github.com/kubernetes/kubernetes/issues/109214 + +### Goals + +- Allow users to optionally define the number of replicas to scale in the case of metric retrieval failure. + +### Non-Goals + +- N/A + +## Proposal + +Heavily inspired by [KEDA][] propose to add a new field to the existing [`HorizontalPodAutoscalerBehavior`][] object: + +- `fallback`: an optional new object containing the following fields: + - `failureThreshold`: (integer) the number of failures fetching metrics to trigger the fallback behavior. Must be a value greater than 0. This field is optional and defaults to 3 if not specified. + - `replicas`: (integer) the number of replicas to scale to in case of fallback. Must be greater than 0 and it's mandatory. + +To allow for tracking of failures to fetch metrics a new field should be added to the existing [`HorizontalPodAutoscalerStatus`][] object: +- `consecutiveMetricRetrievalFailureCount`: (integer) tracks the number of consecutive failures in retrieving metrics. + +When the `behavior` field on the [`HorizontalPodAutoscalerSpec`][] or the `fallback` field in the [`HorizontalPodAutoscalerBehavior`][] +are not specified, the current behavior is preserved, meaning no scaling operations will occur in the event of a metrics retrieval failure. + +[KEDA]: https://keda.sh/docs/2.15/reference/scaledobject-spec/#fallback +[HorizontalPodAutoscalerBehavior]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#horizontalpodautoscalerbehavior-v2-autoscaling +[HorizontalPodAutoscalerStatus]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#horizontalpodautoscalerstatus-v2-autoscaling +[HorizontalPodAutoscalerSpec]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.31/#horizontalpodautoscalerspec-v2-autoscaling + +### Risks and Mitigations + +There should be minimal risk introduced by the proposed changes: +- The new field is optional, and its absence results in no changes to the current autoscaling behavior +- If a change to the new field results in undesirable behavior, the change can be reverted by deploying the previous version of the HPA resource, or removing the `fallback` field entirely. + +## Design Details + +The `HorizontalPodAutoscaler` API is updated to have a new object `HPAFallback`: + +```golang +type HPAFallback struct { + // failureThreshold is the number of failures fetching metrics to trigger the + // fallback behavior. + // +optional + FailureThreshold *int32 + + // failureThreshold is the number of replicas to scale to in case of fallback. + Replicas int32 +} +``` + +The `HorizontalPodAutoscaler` API is updated to add a new `fallback` field to the `HorizontalPodAutoscalerBehavior` object: + +```golang +type HorizontalPodAutoscalerBehavior struct { + // fallback specifies the number of replicas to scale the object to during a + // fallback state and defines the threshold for errors required to enter the + // fallback state. + //+optional + Fallback *HPAFallback + + // Existing fields. + ScaleUp *HPAScalingRules + ScaleDown *HPAScalingRules +} +``` + +The `HorizontalPodAutoscaler` API is updated to have a new description of the `behavior` field on the `HorizontalPodAutoscalerSpec` object: + +```golang +type HorizontalPodAutoscalerSpec struct { + // behavior configures the scaling behavior of the target, including + // scale-up and scale-down policies, as well as fallback behavior in case + // of metric retrieval failures. If not set, the default HPAScalingRules + // are used for scaling decisions, and no scaling operation will occur + // when metrics retrieval fails. + // +optional + Behavior *HorizontalPodAutoscalerBehavior + + // Existing fields. + ScaleTargetRef CrossVersionObjectReference + MinReplicas *int32 + MaxReplicas int32 + Metrics []MetricSpec +} +``` + +The `HorizontalPodAutoscaler` API is updated to add a new `fallback` field to the `HorizontalPodAutoscalerStatus` object: + +```golang +type HorizontalPodAutoscalerStatus struct { + // consecutiveMetricRetrievalFailureCount tracks the number of consecutive failures in retrieving metrics. + //+optional + ConsecutiveMetricRetrievalFailureCount int32 + + // Existing fields. + ObservedGeneration *int64 + LastScaleTime *metav1.Time + CurrentReplicas int32 + DesiredReplicas int32 + CurrentMetrics []MetricStatus + Conditions []HorizontalPodAutoscalerCondition +} +``` +The `HorizontalPodAutoscaler` API is updated to introduce a new FallbackActive condition to the `HorizontalPodAutoscalerConditionType`: + +```golang +const ( + // FallbackActive indicates that the HPA has entered the fallback state due to repeated + // metric retrieval failures and is applying the configured fallback behavior. + FallbackActive HorizontalPodAutoscalerConditionType = "FallbackActive" + + // Existing conditions + ScalingActive HorizontalPodAutoscalerConditionType = "ScalingActive" + AbleToScale HorizontalPodAutoscalerConditionType = "AbleToScale" + ScalingLimited HorizontalPodAutoscalerConditionType = "ScalingLimited" +) +``` + +The new fallback field will be used in the autoscaling controller +[horizontal.go][]. The current logic is: + +```golang +if err != nil && metricDesiredReplicas == -1 { + a.setCurrentReplicasAndMetricsInStatus(hpa, currentReplicas, metricStatuses) + if err := a.updateStatusIfNeeded(ctx, hpaStatusOriginal, hpa); err != nil { + utilruntime.HandleError(err) + } + a.eventRecorder.Event(hpa, v1.EventTypeWarning, "FailedComputeMetricsReplicas", err.Error()) + return fmt.Errorf("failed to compute desired number of replicas based on listed metrics for %s: %v", reference, err) +} +``` + +It will be replaced by: + +```golang +if err != nil && metricDesiredReplicas == -1 { + a.increaseConsecutiveMetricRetrievalFailureCount(hpa) + a.eventRecorder.Event(hpa, v1.EventTypeWarning, "FailedComputeMetricsReplicas", err.Error()) + + var inFallback bool + + if hpa.Spec.Fallback != nil { + var failureThreshold int32 + + if hpa.Spec.Fallback.FailureThreshold != nil { + failureThreshold = *hpa.Spec.Fallback.FailureThreshold + } else { + // Default value + failureThreshold = 3 + } + + if failureThreshold < hpa.Status.ConsecutiveMetricRetrievalFailureCount { + inFallback = true + metricDesiredReplicas = hpa.Spec.Fallback.Replicas + a.eventRecorder.Event(hpa, v1.EventTypeWarning, "FallbackThresholdReached", err.Error()) + setCondition(hpa, autoscalingv2.FallbackActive, v1.ConditionTrue, "FallbackThresholdReached", "%s", err.Error()) + } else { + setCondition(hpa, autoscalingv2.FallbackActive, v1.ConditionFalse, "FallbackThresholdNotReached", "Threshold is set to %d failures. Current failure count is %d", failureThreshold, hpa.Status.ConsecutiveMetricRetrievalFailureCount) + inFallback = false + } + } else { + setCondition(hpa, autoscalingv2.FallbackActive, v1.ConditionFalse, "NoFallbackDefined", "No fallback behavior is defined") + inFallback = false + } + + if !inFallback { + a.setCurrentReplicasAndMetricsInStatus(hpa, currentReplicas, metricStatuses) + if err := a.updateStatusIfNeeded(ctx, hpaStatusOriginal, hpa); err != nil { + utilruntime.HandleError(err) + } + return fmt.Errorf("failed to compute desired number of replicas based on listed metrics for %s: %v", reference, err) + } +} +setCondition(hpa, autoscalingv2.FallbackActive, v1.ConditionFalse, "SucceededToComputeDesiredReplicas", "the HPA controller was able to compute the desired replicas") +``` + +[horizontal.go]: https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/podautoscaler/horizontal.go + +### Test Plan + + + +[ ] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- ``: `` - `` + +##### Integration tests + + + + + +- : + +##### e2e tests + + + +Will the follow [e2e autoscaling tests]: + +- Failure of retrieving metrics over the threshold scales the resource with the configured replicas +- Success in retrieving metrics should reset the `ConsecutiveMetricRetrievalFailureCount` in the `HorizontalPodAutoscalerStatus` +- When `fallback` is not set the resource should not scale when failing to retrieve metrics + +[e2e autoscaling tests]: https://github.com/kubernetes/kubernetes/tree/master/test/e2e/autoscaling + +### Graduation Criteria + + + +#### Alpha + +- Feature implemented behind a `HPAFallback` feature flag +- Initial e2e tests completed and enabled + +### Upgrade / Downgrade Strategy + +When the feature flag is enabled, the `kube-controller-manager` should begin +counting concurrent failures starting from 0. If the feature flag is disabled, +the status should always reflect `MetricRetrievalFailureCount` as 0. + +All logic related to metric retrieval failure and `MetricRetrievalFailureCount` +evaluation must be gated by the same feature flag. This means that if the feature +flag is rolled back, any ongoing metrics retrieval failures will not affect scaling +behavior, and the resource will continue with the same scale as it did prior to +the feature being disabled. + + + +### Version Skew Strategy + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: HPAFallback + - Components depending on the feature gate: `kube-controller-manager` + +###### Does enabling the feature change any default behavior? + +No. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + +Yes. + +###### What happens if we reenable the feature if it was previously rolled back? + +###### Are there any tests for feature enablement/disablement? + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +###### Will enabling / using this feature result in introducing new API types? + + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-autoscaling/5053-hpa-fallback/kep.yaml b/keps/sig-autoscaling/5053-hpa-fallback/kep.yaml new file mode 100644 index 00000000000..fce1b70e160 --- /dev/null +++ b/keps/sig-autoscaling/5053-hpa-fallback/kep.yaml @@ -0,0 +1,38 @@ +title: Fallback for HPA on failure to retrieve metrics +kep-number: 5053 +authors: + - "@be0x74a" +owning-sig: sig-autoscaling +status: provisional +creation-date: 2025-01-20 +reviewers: + - TBD +approvers: + - TBD + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: TBD + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: TBD + beta: TBD + stable: TBD + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: HPAFallback + components: + - kube-apiserver + - kube-controller-manager +disable-supported: true + +# The following PRR answers are required at beta release +#metrics: +# - my_feature_metric