Description
Report
I have a ScaledObject configured with a CPU trigger and a Prometheus trigger,
and a fallback
configuration in case prometheus goes down.
I assumed that if the desired replica count based on CPU utilization is higher than the fallback, then it will take precedence over the fallback.
However, when simulating prometheus failure by changing the prometheus URL to an invalid one,
it turns out that the Deployment is scaled down to exactly fallback.replicas
despite the CPU usage indicating a higher desired replica count.
(Everything described above also happened on an actual Prometheus outage (without modifying the URL), but I used the url modification approach to reproduce it to obtain more details, which are below).
When that happens, the HPA's desired replicas is higher than the fallback, equal to what I'd expect based on CPU scaling.
However, the following can be seen in keda-operator's logs:
2025-03-17T17:04:41Z ERROR scale_handler error getting scale decision {"scaledObject.Namespace": "kube-system", "scaledObject.Name": "haproxy-internal-haproxy-ingress", "scaler": "prometheusScaler", "error": "Get \"http://prometheus-operato-prometheus.monitoring:9090//api/v1/query?query=%28%0A++vector%283%29%0A%29&time=2025-03-17T17:04:41Z\": dial tcp: lookup prometheus-operato-prometheus.monitoring on 172.20.0.10:53: no such host"}
Dgithub.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScalerState
/workspace/pkg/scaling/scale_handler.go:780
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledObjectState.func1
/workspace/pkg/scaling/scale_handler.go:633
2025-03-17T17:04:41Z INFO scaleexecutor Successfully set ScaleTarget replicas count to ScaledObject fallback.replicas {"scaledobject.Name": "haproxy-internal-haproxy-ingress", "scaledObject.Namespace": "kube-system", "scaleTarget.Name": "haproxy-internal-haproxy-ingress", "Original Replicas Count": 6, "New Replicas Count": 4}
And kubectl describe
on the Deployment being scaled shows the following events:
Normal ScalingReplicaSet 16m (x4 over 20m) deployment-controller Scaled up replica set haproxy-internal-haproxy-ingress-7dd6f99979 to 7 from 4
Normal ScalingReplicaSet 16m (x5 over 20m) deployment-controller Scaled down replica set haproxy-internal-haproxy-ingress-7dd6f99979 to 4 from 7
Normal ScalingReplicaSet 15m (x8 over 21m) deployment-controller Scaled up replica set haproxy-internal-haproxy-ingress-7dd6f99979 to 6 from 4
Normal ScalingReplicaSet 2m7s (x30 over 21m) deployment-controller Scaled down replica set haproxy-internal-haproxy-ingress-7dd6f99979 to 4 from 6
It seems like keda-operator is overriding deployment.spec.replicas
set by HPA and setting it to fallback.replicas
.
ScaledObject yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: [...]
spec:
advanced:
horizontalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 300
scalingModifiers: {}
cooldownPeriod: 300
fallback:
failureThreshold: 20
replicas: 4
maxReplicaCount: 10
minReplicaCount: 2
pollingInterval: 30
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: haproxy-internal-haproxy-ingress
triggers:
- metadata:
value: "10"
metricType: Utilization
type: cpu
- metadata:
query: |
(
vector(3)
)
serverAddress: http://prometheus-operator-prometheus.monitoring:9090/
threshold: "1"
type: prometheus
Expected Behavior
In case of prometheus failure, the Deployment is scaled based on max(cpu, fallback)
Actual Behavior
In case of prometheus failure, the Deployment is caled down to fallback.replicas
even if scaling purely based on cpu usage would result in more replicas.
Steps to Reproduce the Problem
- Create a ScaledObject with a CPU trigger and a prometheus trigger, scaling a Deployment
- Set a fallback with a replica count higher than would result from the prometheus trigger
- Generate load, so that the CPU trigger scales above
fallback.replicas
- Change the prometheus URL in ScaledObject to an invalid one to simulate prometheus failure
- Observe the replica count of the Deployment being scaled
Logs from KEDA operator
2025-03-17T17:04:41Z ERROR scale_handler error getting scale decision {"scaledObject.Namespace": "kube-system", "scaledObject.Name": "haproxy-internal-haproxy-ingress", "scaler": "prometheusScaler", "error": "Get \"http://prometheus-operato-prometheus.monitoring:9090//api/v1/query?query=%28%0A++vector%283%29%0A%29&time=2025-03-17T17:04:41Z\": dial tcp: lookup prometheus-operato-prometheus.monitoring on 172.20.0.10:53: no such host"}
Dgithub.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScalerState
/workspace/pkg/scaling/scale_handler.go:780
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledObjectState.func1
/workspace/pkg/scaling/scale_handler.go:633
2025-03-17T17:04:41Z INFO scaleexecutor Successfully set ScaleTarget replicas count to ScaledObject fallback.replicas {"scaledobject.Name": "haproxy-internal-haproxy-ingress", "scaledObject.Namespace": "kube-system", "scaleTarget.Name": "haproxy-internal-haproxy-ingress", "Original Replicas Count": 6, "New Replicas Count": 4}
KEDA Version
2.15.1
Kubernetes Version
1.30
Platform
Amazon Web Services
Scaler Details
cpu, prometheus
Anything else?
This behaviour is consistent with this code in v2.15.1:
keda/pkg/scaling/executor/scale_scaledobjects.go
Lines 233 to 238 in 09a4951
Which was removed in #6520 914163c in order to fix to #6053, but it seems to fix this issue as well:
I cherry-picked that commit on top of v2.15.1 and the resulting keda-operator no longer exhibits the problem I described.
However, I thought I'd still file this bug because I think it's more serious than the issue described in #6053, so others might want to know that it exists, which commit it was fixed in, and which release that commit eventually ends up in.
Also, I don't know what the release schedule of keda is, but a point release of v2.15 or v2.16 with a backport of 914163c would be nice.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status