Skip to content

Comments

Add retry logic to Azure Service Bus scaler#7339

Open
ramon-carrasco wants to merge 1 commit intokedacore:mainfrom
ramon-carrasco:add-servicebus-retry-logic
Open

Add retry logic to Azure Service Bus scaler#7339
ramon-carrasco wants to merge 1 commit intokedacore:mainfrom
ramon-carrasco:add-servicebus-retry-logic

Conversation

@ramon-carrasco
Copy link

@ramon-carrasco ramon-carrasco commented Dec 22, 2025

  • Add configurable maxRetries parameter (default: 0)
  • Implement exponential backoff (2s → 60s cap)
  • Add unit tests for metadata parsing
  • Update schema files
  • Respect context cancellation

Fixes #7338

Relates to kedacore/keda-docs#1677

Checklist

  • I have verified that my change is according to the deprecations & breaking changes policy
  • Tests have been added
  • Ensure make generate-scalers-schema has been run to update any outdated generated files.
  • Changelog has been updated and is aligned with our changelog requirements
  • A PR is opened to update the documentation on (repo) (if applicable)
  • Commits are signed with Developer Certificate of Origin (DCO - learn more)

  - Add configurable maxRetries parameter (default: 0)
  - Implement exponential backoff (2s → 60s cap)
  - Add unit tests for metadata parsing
  - Update schema files
  - Respect context cancellation

  Fixes kedacore#7338

Signed-off-by: Ramon Carrasco <ramon.carrasco@duckcreek.com>
@ramon-carrasco ramon-carrasco requested a review from a team as a code owner December 22, 2025 20:24
@github-actions
Copy link

Thank you for your contribution! 🙏

Please understand that we will do our best to review your PR and give you feedback as soon as possible, but please bear with us if it takes a little longer as expected.

While you are waiting, make sure to:

  • Add an entry in our changelog in alphabetical order and link related issue
  • Update the documentation, if needed
  • Add unit & e2e tests for your changes
  • GitHub checks are passing
  • Is the DCO check failing? Here is how you can fix DCO issues

Once the initial tests are successful, a KEDA member will ensure that the e2e tests are run. Once the e2e tests have been successfully completed, the PR may be merged at a later date. Please be patient.

Learn more about our contribution guide.

@keda-automation keda-automation requested a review from a team December 22, 2025 20:24
@snyk-io
Copy link

snyk-io bot commented Dec 22, 2025

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@JorTurFer
Copy link
Member

Hello
Although I understand the problem, I think that this won't work because HPA will get timeouts during long backoffs. The current code holds the HPA request during the backoff, so I'm sure that timeouts will happen somewhere. The current scenario that you're trying to solve can be "mitigated" by using useCachedMetrics at trigger level. This will make that your metric is requested by the operator and cached, so the HPA will always see a value for it.

Could this solve your problem? If not, we need to think in something like storing the last valid value and returning it during the "failing cycle" and just keeping to HPA cycles the "retry" during next cycle. WDYT?

@ramon-carrasco
Copy link
Author

Thanks for the feedback! You're absolutely right about the HPA timeout issue, I missed that.

While I am almost sure the useCachedMetrics approach would reduce the number of errors we are seeing, we would still be in the position of not getting the real-time queue length that we need to know if scaling is needed, which is what we are trying to solve.

I'm happy to try to redesign this with a non-blocking approach (return last known value to HPA + background retry).

@JorTurFer
Copy link
Member

we would still be in the position of not getting the real-time queue length that we need to know if scaling is needed, which is what we are trying to solve

What about setting a polling interval of 15 seconds? With this interval, you will have the same interval as the HPA and the retry will be done automatically during next HPA cycle. Also you can reduce the time to 10 seconds and that'll do 6 request per minute to the upstream, which is the current amount of requests done (2 from operator and another 4 from HPA, now 6 from operator and 0 from HPA)

@ramon-carrasco
Copy link
Author

What about setting a polling interval of 15 seconds? With this interval, you will have the same interval as the HPA and the retry will be done automatically during next HPA cycle. Also you can reduce the time to 10 seconds and that'll do 6 request per minute to the upstream, which is the current amount of requests done (2 from operator and another 4 from HPA, now 6 from operator and 0 from HPA)

We are currently using the default polling interval in all scaledObjects, which means that everything should be polling every 30 seconds, which I have always seen as a pretty long interval which shouldn't be the cause of the errors we are seeing with service bus, but I would expect that decreasing that time to 15 seconds would be harmful to the number of errors we are seeing. Is that assumption correct? In periods when there are many instances trying to read from the service bus and getting thousands of errors, I am worried about what increasing the polling frequency would cause.

@JorTurFer
Copy link
Member

We are currently using the default polling interval in all scaledObjects, which means that everything should be polling every 30 seconds, which I have always seen as a pretty long interval which shouldn't be the cause of the errors we are seeing with service bus, but I would expect that decreasing that time to 15 seconds would be harmful to the number of errors we are seeing. Is that assumption correct? In periods when there are many instances trying to read from the service bus and getting thousands of errors, I am worried about what increasing the polling frequency would cause.

The point here is that you are already pulling metrics 6 times per minute with default values. pollingInterval is used by KEDA operator to pull metrics for scaling from/to zero proposes and default value is 30 seconds, so it does 2 request per minute, but the HPA controller is checking the metric every 15 seconds, so it's doing another 4 request per minute.

When you enable cached metrics, KEDA operator will store the metric value requested every polling interval and give it to the HPA controller when it request the metric, so if you reduce the polling interval from 30s to 15s but you enable cached metrics, you will do 4 request per minute from operator but 0 from metrics server, so actually you're reducing the request to the upstream (because the HPA controller is querying every 15 seconds in any case)

@ramon-carrasco
Copy link
Author

The point here is that you are already pulling metrics 6 times per minute with default values. pollingInterval is used by KEDA operator to pull metrics for scaling from/to zero proposes and default value is 30 seconds, so it does 2 request per minute, but the HPA controller is checking the metric every 15 seconds, so it's doing another 4 request per minute.

When you enable cached metrics, KEDA operator will store the metric value requested every polling interval and give it to the HPA controller when it request the metric, so if you reduce the polling interval from 30s to 15s but you enable cached metrics, you will do 4 request per minute from operator but 0 from metrics server, so actually you're reducing the request to the upstream (because the HPA controller is querying every 15 seconds in any case)

Thank you so much for taking the time to explain this in detail. I see your point now and I think it's actually a great advice. I'll give it a try and monitor behavior.

@JorTurFer
Copy link
Member

Thank you so much for taking the time to explain this in detail. I see your point now and I think it's actually a great advice. I'll give it a try and monitor behavior.

nice! let us know how it goes 😄 can we close this PR in the meantime? you can re-open it again if needed

@rickbrouwer rickbrouwer added the waiting-author-response All PR's or Issues where we are waiting for a response from the author label Jan 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting-author-response All PR's or Issues where we are waiting for a response from the author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Azure Service Bus Scaler: Add retry logic for transient API failures

3 participants