Skip to content

Conversation

drolando-stripe
Copy link

@drolando-stripe drolando-stripe commented Sep 2, 2025

We'd like to have a way to monitor the number of Keda errors due to empty responses from prometheus after enabling the ignoreNullValues flag for most of our prometheus triggers.

Right now this error gets logged but the error metric that Keda emits is generic and doesn't differentiate by error type.

Tests

I tried adding an e2e test but they don't actually run prometheus, so all queries fail with dial tcp: lookup keda-prometheus.keda.svc.cluster.local on 10.96.0.10:53: no such host. The only 2 tests in tests/sequential/prometheus_metrics/prometheus_metrics_test.go to use a prometheus trigger only look for errors and don't actually run the query.

I could add a test that shows that the metric exists and is zero, but that doesn't seem very useful.

Checklist

  • When introducing a new scaler, I agree with the scaling governance policy N/A
  • I have verified that my change is according to the deprecations & breaking changes policy
  • Tests have been added
  • Changelog has been updated and is aligned with our changelog requirements
  • A PR is opened to update our Helm chart (repo) (if applicable, ie. when deployment manifests are modified) N/A
  • A PR is opened to update the documentation on (repo) (if applicable) N/A
  • Commits are signed with Developer Certificate of Origin (DCO - learn more)

Fixes #7062

Copy link

github-actions bot commented Sep 2, 2025

Thank you for your contribution! 🙏

Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected.

While you are waiting, make sure to:

  • Add an entry in our changelog in alphabetical order and link related issue
  • Update the documentation, if needed
  • Add unit & e2e tests for your changes
  • GitHub checks are passing
  • Is the DCO check failing? Here is how you can fix DCO issues

Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient.

Learn more about our contribution guide.

@keda-automation keda-automation requested review from a team September 2, 2025 22:24
@drolando-stripe drolando-stripe force-pushed the drolando/add_empty_response_metric branch 2 times, most recently from 3f967e9 to 4a13d88 Compare September 3, 2025 18:56
@drolando-stripe drolando-stripe force-pushed the drolando/add_empty_response_metric branch from 7cd8675 to 6f1d422 Compare September 25, 2025 22:13
@drolando-stripe drolando-stripe force-pushed the drolando/add_empty_response_metric branch from 6f1d422 to d642098 Compare September 25, 2025 22:15
@drolando-stripe drolando-stripe marked this pull request as ready for review September 25, 2025 22:15
Copy link
Contributor

@aliaqel-stripe aliaqel-stripe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, you should open a PR in the keda-docs repository to update documentation to add this metric. My only question will be is if the maintainers would like this metric to be more generic but like you said, the prometheus scaler is the only scaler with isNullOrEmpty as a configurable field.

@wozniakjan I know you're in charge of cutting this release. Wanted to put this on your radar.

@JorTurFer
Copy link
Member

JorTurFer commented Sep 29, 2025

Hello
Personally, I'd not merge this as it's exposing some specific scaler info as KEDA metric in a way that only fits for prometheus.
I prefer to do it in a generic way like adding a new metric for empty upstream responses or so. With this, we can cover prometheus but also other scalers with support for empty responses.

tbh, I think that this PR isa great starting point and just updating it a bit we can make it generic

@rickbrouwer
Copy link
Member

Hello Personally, I'd not merge this as it's exposing some specific scaler info as KEDA metric in a way that only fits for prometheus. I prefer to do it in a generic way like adding a new metric for empty upstream responses or so. With this, we can cover prometheus but also other scalers with support for empty responses.

tbh, I think that this PR isa great starting point and just updating it a bit we can make it generic

Agree

prometheus.CounterOpts{
Namespace: DefaultPromMetricsNamespace,
Subsystem: "prometheus",
Name: "metrics_empty_error_total",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rickbrouwer @JorTurFer Thoughts on naming this empty_upstream_responses_total?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upstream is quite generic, what about scaler/trigger instead?

@drolando-stripe drolando-stripe force-pushed the drolando/add_empty_response_metric branch from a30f2bc to 3ad2203 Compare September 29, 2025 17:37
Signed-off-by: Daniele Rolando <[email protected]>
@zroubalik zroubalik mentioned this pull request Sep 30, 2025
22 tasks
Co-authored-by: Jorge Turrado Ferrero <[email protected]>
Signed-off-by: drolando-stripe <[email protected]>
@keda-automation keda-automation requested a review from a team October 3, 2025 19:09
Co-authored-by: Jorge Turrado Ferrero <[email protected]>
Signed-off-by: drolando-stripe <[email protected]>
Copy link
Member

@JorTurFer JorTurFer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking nice! Only one small thing inline about implementation.
Could you include this new metric during prometheus/otel e2e test?

Comment on lines +212 to +216
func RecordEmptyUpstreamResponse() {
for _, element := range collectors {
element.RecordEmptyUpstreamResponse()
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's enrich this metric with information about ScaledJob|ScaledObject and trigger that has recorded the empty response. You can see how it's done with other metric like RecordCloudEventEmittedError

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it. The problem is that the ExecutePromQuery function doesn't have access to the trigger or scaledobject name afaict.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ScaledObject name and the trigger name are passed to the scaler during the build process as part of scalersconfig.ScalerConfig struct, so you will need to take them and store them in the scaler to have the avalable for the metric but I think that the info is quite valiable to know which scaler is failing

@drolando-stripe
Copy link
Author

Could you include this new metric during prometheus/otel e2e test?

@JorTurFer I left a note about this in the PR description. The end-to-end tests don't actually run prometheus, so the scaler query always fails with dial tcp: lookup keda-prometheus.keda.svc.cluster.local on 10.96.0.10:53: no such host and since keda runs in a separate container from the tests I cannot mock it to return an empty value

@zroubalik
Copy link
Member

Could you include this new metric during prometheus/otel e2e test?

@JorTurFer I left a note about this in the PR description. The end-to-end tests don't actually run prometheus, so the scaler query always fails with dial tcp: lookup keda-prometheus.keda.svc.cluster.local on 10.96.0.10:53: no such host and since keda runs in a separate container from the tests I cannot mock it to return an empty value

@drolando-stripe What do you mean by that, please? The existing prom e2e tests use this helper to setup prometheus: https://github.com/kedacore/keda/blob/main/tests/scalers/prometheus/prometheus_helper.go

@SpiritZhou
Copy link
Contributor

I see there are other upstream error responses from this scaler. Is an empty response enough for tracking purposes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Emit a metric tracking the number of empty responses from prometheus

6 participants