Add retry logic to Azure Service Bus scaler by ramon-carrasco · Pull Request #7339 · kedacore/keda

ramon-carrasco · 2025-12-22T20:24:04Z

Add configurable maxRetries parameter (default: 0)
Implement exponential backoff (2s → 60s cap)
Add unit tests for metadata parsing
Update schema files
Respect context cancellation

Checklist

I have verified that my change is according to the deprecations & breaking changes policy
Tests have been added
Ensure make generate-scalers-schema has been run to update any outdated generated files.
Changelog has been updated and is aligned with our changelog requirements
A PR is opened to update the documentation on (repo) (if applicable)
Commits are signed with Developer Certificate of Origin (DCO - learn more)

- Add configurable maxRetries parameter (default: 0) - Implement exponential backoff (2s → 60s cap) - Add unit tests for metadata parsing - Update schema files - Respect context cancellation Fixes kedacore#7338 Signed-off-by: Ramon Carrasco <ramon.carrasco@duckcreek.com>

github-actions · 2025-12-22T20:24:12Z

Thank you for your contribution! 🙏

Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected.

While you are waiting, make sure to:

Add an entry in our changelog in alphabetical order and link related issue
Update the documentation, if needed
Add unit & e2e tests for your changes
GitHub checks are passing
Is the DCO check failing? Here is how you can fix DCO issues

Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient.

Learn more about our contribution guide.

snyk-io · 2025-12-22T20:24:19Z

✅ Snyk checks have passed. No issues have been found so far.

Status	Scanner	Critical	High	Medium	Low	Total (0)
✅	Open Source Security	0	0	0	0	0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

JorTurFer · 2025-12-23T08:04:52Z

Hello
Although I understand the problem, I think that this won't work because HPA will get timeouts during long backoffs. The current code holds the HPA request during the backoff, so I'm sure that timeouts will happen somewhere. The current scenario that you're trying to solve can be "mitigated" by using useCachedMetrics at trigger level. This will make that your metric is requested by the operator and cached, so the HPA will always see a value for it.

Could this solve your problem? If not, we need to think in something like storing the last valid value and returning it during the "failing cycle" and just keeping to HPA cycles the "retry" during next cycle. WDYT?

ramon-carrasco · 2025-12-23T19:13:19Z

Thanks for the feedback! You're absolutely right about the HPA timeout issue, I missed that.

While I am almost sure the useCachedMetrics approach would reduce the number of errors we are seeing, we would still be in the position of not getting the real-time queue length that we need to know if scaling is needed, which is what we are trying to solve.

I'm happy to try to redesign this with a non-blocking approach (return last known value to HPA + background retry).

JorTurFer · 2026-01-01T17:02:10Z

we would still be in the position of not getting the real-time queue length that we need to know if scaling is needed, which is what we are trying to solve

What about setting a polling interval of 15 seconds? With this interval, you will have the same interval as the HPA and the retry will be done automatically during next HPA cycle. Also you can reduce the time to 10 seconds and that'll do 6 request per minute to the upstream, which is the current amount of requests done (2 from operator and another 4 from HPA, now 6 from operator and 0 from HPA)

ramon-carrasco · 2026-01-06T21:11:39Z

What about setting a polling interval of 15 seconds? With this interval, you will have the same interval as the HPA and the retry will be done automatically during next HPA cycle. Also you can reduce the time to 10 seconds and that'll do 6 request per minute to the upstream, which is the current amount of requests done (2 from operator and another 4 from HPA, now 6 from operator and 0 from HPA)

We are currently using the default polling interval in all scaledObjects, which means that everything should be polling every 30 seconds, which I have always seen as a pretty long interval which shouldn't be the cause of the errors we are seeing with service bus, but I would expect that decreasing that time to 15 seconds would be harmful to the number of errors we are seeing. Is that assumption correct? In periods when there are many instances trying to read from the service bus and getting thousands of errors, I am worried about what increasing the polling frequency would cause.

JorTurFer · 2026-01-07T23:03:43Z

We are currently using the default polling interval in all scaledObjects, which means that everything should be polling every 30 seconds, which I have always seen as a pretty long interval which shouldn't be the cause of the errors we are seeing with service bus, but I would expect that decreasing that time to 15 seconds would be harmful to the number of errors we are seeing. Is that assumption correct? In periods when there are many instances trying to read from the service bus and getting thousands of errors, I am worried about what increasing the polling frequency would cause.

The point here is that you are already pulling metrics 6 times per minute with default values. pollingInterval is used by KEDA operator to pull metrics for scaling from/to zero proposes and default value is 30 seconds, so it does 2 request per minute, but the HPA controller is checking the metric every 15 seconds, so it's doing another 4 request per minute.

When you enable cached metrics, KEDA operator will store the metric value requested every polling interval and give it to the HPA controller when it request the metric, so if you reduce the polling interval from 30s to 15s but you enable cached metrics, you will do 4 request per minute from operator but 0 from metrics server, so actually you're reducing the request to the upstream (because the HPA controller is querying every 15 seconds in any case)

ramon-carrasco · 2026-01-09T18:26:41Z

The point here is that you are already pulling metrics 6 times per minute with default values. pollingInterval is used by KEDA operator to pull metrics for scaling from/to zero proposes and default value is 30 seconds, so it does 2 request per minute, but the HPA controller is checking the metric every 15 seconds, so it's doing another 4 request per minute.

When you enable cached metrics, KEDA operator will store the metric value requested every polling interval and give it to the HPA controller when it request the metric, so if you reduce the polling interval from 30s to 15s but you enable cached metrics, you will do 4 request per minute from operator but 0 from metrics server, so actually you're reducing the request to the upstream (because the HPA controller is querying every 15 seconds in any case)

Thank you so much for taking the time to explain this in detail. I see your point now and I think it's actually a great advice. I'll give it a try and monitor behavior.

JorTurFer · 2026-01-12T19:10:36Z

Thank you so much for taking the time to explain this in detail. I see your point now and I think it's actually a great advice. I'll give it a try and monitor behavior.

nice! let us know how it goes 😄 can we close this PR in the meantime? you can re-open it again if needed

ramon-carrasco requested a review from a team as a code owner December 22, 2025 20:24

keda-automation requested a review from a team December 22, 2025 20:24

rickbrouwer added the waiting-author-response All PR's or Issues where we are waiting for a response from the author label Jan 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add retry logic to Azure Service Bus scaler#7339

Add retry logic to Azure Service Bus scaler#7339
ramon-carrasco wants to merge 1 commit intokedacore:mainfrom
ramon-carrasco:add-servicebus-retry-logic

ramon-carrasco commented Dec 22, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 22, 2025

Uh oh!

snyk-io bot commented Dec 22, 2025

Uh oh!

JorTurFer commented Dec 23, 2025

Uh oh!

ramon-carrasco commented Dec 23, 2025

Uh oh!

JorTurFer commented Jan 1, 2026

Uh oh!

ramon-carrasco commented Jan 6, 2026

Uh oh!

JorTurFer commented Jan 7, 2026

Uh oh!

ramon-carrasco commented Jan 9, 2026

Uh oh!

JorTurFer commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

ramon-carrasco commented Dec 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

github-actions bot commented Dec 22, 2025

Uh oh!

snyk-io bot commented Dec 22, 2025

✅ Snyk checks have passed. No issues have been found so far.

Uh oh!

JorTurFer commented Dec 23, 2025

Uh oh!

ramon-carrasco commented Dec 23, 2025

Uh oh!

JorTurFer commented Jan 1, 2026

Uh oh!

ramon-carrasco commented Jan 6, 2026

Uh oh!

JorTurFer commented Jan 7, 2026

Uh oh!

ramon-carrasco commented Jan 9, 2026

Uh oh!

JorTurFer commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ramon-carrasco commented Dec 22, 2025 •

edited

Loading