Skip to content

Fallback overrides HPA's internal triggers #6659

Open
@wolf-cosmose

Description

@wolf-cosmose

Report

I have a ScaledObject configured with a CPU trigger and a Prometheus trigger,
and a fallback configuration in case prometheus goes down.

I assumed that if the desired replica count based on CPU utilization is higher than the fallback, then it will take precedence over the fallback.

However, when simulating prometheus failure by changing the prometheus URL to an invalid one,
it turns out that the Deployment is scaled down to exactly fallback.replicas despite the CPU usage indicating a higher desired replica count.

(Everything described above also happened on an actual Prometheus outage (without modifying the URL), but I used the url modification approach to reproduce it to obtain more details, which are below).

When that happens, the HPA's desired replicas is higher than the fallback, equal to what I'd expect based on CPU scaling.

However, the following can be seen in keda-operator's logs:

2025-03-17T17:04:41Z    ERROR    scale_handler    error getting scale decision    {"scaledObject.Namespace": "kube-system", "scaledObject.Name": "haproxy-internal-haproxy-ingress", "scaler": "prometheusScaler", "error": "Get \"http://prometheus-operato-prometheus.monitoring:9090//api/v1/query?query=%28%0A++vector%283%29%0A%29&time=2025-03-17T17:04:41Z\": dial tcp: lookup prometheus-operato-prometheus.monitoring on 172.20.0.10:53: no such host"}
Dgithub.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScalerState
    /workspace/pkg/scaling/scale_handler.go:780
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledObjectState.func1
    /workspace/pkg/scaling/scale_handler.go:633
2025-03-17T17:04:41Z    INFO    scaleexecutor    Successfully set ScaleTarget replicas count to ScaledObject fallback.replicas    {"scaledobject.Name": "haproxy-internal-haproxy-ingress", "scaledObject.Namespace": "kube-system", "scaleTarget.Name": "haproxy-internal-haproxy-ingress", "Original Replicas Count": 6, "New Replicas Count": 4}

And kubectl describe on the Deployment being scaled shows the following events:

Normal  ScalingReplicaSet  16m (x4 over 20m)    deployment-controller  Scaled up replica set haproxy-internal-haproxy-ingress-7dd6f99979 to 7 from 4
Normal  ScalingReplicaSet  16m (x5 over 20m)    deployment-controller  Scaled down replica set haproxy-internal-haproxy-ingress-7dd6f99979 to 4 from 7
Normal  ScalingReplicaSet  15m (x8 over 21m)    deployment-controller  Scaled up replica set haproxy-internal-haproxy-ingress-7dd6f99979 to 6 from 4
Normal  ScalingReplicaSet  2m7s (x30 over 21m)  deployment-controller  Scaled down replica set haproxy-internal-haproxy-ingress-7dd6f99979 to 4 from 6

It seems like keda-operator is overriding deployment.spec.replicas set by HPA and setting it to fallback.replicas.

ScaledObject yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: [...]
spec:
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 300
    scalingModifiers: {}
  cooldownPeriod: 300
  fallback:
    failureThreshold: 20
    replicas: 4
  maxReplicaCount: 10
  minReplicaCount: 2
  pollingInterval: 30
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: haproxy-internal-haproxy-ingress
  triggers:
  - metadata:
      value: "10"
    metricType: Utilization
    type: cpu
  - metadata:
      query: |
        (
          vector(3)
        )
      serverAddress: http://prometheus-operator-prometheus.monitoring:9090/
      threshold: "1"
    type: prometheus

Expected Behavior

In case of prometheus failure, the Deployment is scaled based on max(cpu, fallback)

Actual Behavior

In case of prometheus failure, the Deployment is caled down to fallback.replicas even if scaling purely based on cpu usage would result in more replicas.

Steps to Reproduce the Problem

  1. Create a ScaledObject with a CPU trigger and a prometheus trigger, scaling a Deployment
  2. Set a fallback with a replica count higher than would result from the prometheus trigger
  3. Generate load, so that the CPU trigger scales above fallback.replicas
  4. Change the prometheus URL in ScaledObject to an invalid one to simulate prometheus failure
  5. Observe the replica count of the Deployment being scaled

Logs from KEDA operator

2025-03-17T17:04:41Z    ERROR    scale_handler    error getting scale decision    {"scaledObject.Namespace": "kube-system", "scaledObject.Name": "haproxy-internal-haproxy-ingress", "scaler": "prometheusScaler", "error": "Get \"http://prometheus-operato-prometheus.monitoring:9090//api/v1/query?query=%28%0A++vector%283%29%0A%29&time=2025-03-17T17:04:41Z\": dial tcp: lookup prometheus-operato-prometheus.monitoring on 172.20.0.10:53: no such host"}
Dgithub.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScalerState
    /workspace/pkg/scaling/scale_handler.go:780
github.com/kedacore/keda/v2/pkg/scaling.(*scaleHandler).getScaledObjectState.func1
    /workspace/pkg/scaling/scale_handler.go:633
2025-03-17T17:04:41Z    INFO    scaleexecutor    Successfully set ScaleTarget replicas count to ScaledObject fallback.replicas    {"scaledobject.Name": "haproxy-internal-haproxy-ingress", "scaledObject.Namespace": "kube-system", "scaleTarget.Name": "haproxy-internal-haproxy-ingress", "Original Replicas Count": 6, "New Replicas Count": 4}

KEDA Version

2.15.1

Kubernetes Version

1.30

Platform

Amazon Web Services

Scaler Details

cpu, prometheus

Anything else?

This behaviour is consistent with this code in v2.15.1:

func (e *scaleExecutor) doFallbackScaling(ctx context.Context, scaledObject *kedav1alpha1.ScaledObject, currentScale *autoscalingv1.Scale, logger logr.Logger, currentReplicas int32) {
_, err := e.updateScaleOnScaleTarget(ctx, scaledObject, currentScale, scaledObject.Spec.Fallback.Replicas)
if err == nil {
logger.Info("Successfully set ScaleTarget replicas count to ScaledObject fallback.replicas",
"Original Replicas Count", currentReplicas,
"New Replicas Count", scaledObject.Spec.Fallback.Replicas)

Which was removed in #6520 914163c in order to fix to #6053, but it seems to fix this issue as well:

I cherry-picked that commit on top of v2.15.1 and the resulting keda-operator no longer exhibits the problem I described.

However, I thought I'd still file this bug because I think it's more serious than the issue described in #6053, so others might want to know that it exists, which commit it was fixed in, and which release that commit eventually ends up in.

Also, I don't know what the release schedule of keda is, but a point release of v2.15 or v2.16 with a backport of 914163c would be nice.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    To Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions