-
Notifications
You must be signed in to change notification settings - Fork 1.2k
fix: ScalerCache gets the lock before operate the scalers #6739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Jorge Turrado <[email protected]>
Signed-off-by: Jorge Turrado <[email protected]>
/run-e2e rabbit |
I think this fix deserves a mention in the Changelog as well, if only for traceability 🙂 Wdyt? |
I'll add it later on, yeah you're right |
Yes, I think it is. I can imagine the following:
This causes the panic (with "index out of range" error) |
Signed-off-by: Jorge Turrado <[email protected]>
/run-e2e rabbit |
good find! It's for sure going to make the code more robust. Regarding #6739 - I'm optimistic, it should imho also fix it, the panic stack trace aligns with the changes |
/run-e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a potential race condition in ScalersCache by acquiring the lock before operating on the scalers, ensuring that the slice is not modified concurrently.
- Updated lock acquisition in getScalerBuilder and GetPushScalers
- Adjusted the changelog entry to document the fix
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
pkg/scaling/cache/scalers_cache.go | Moved the read lock to the beginning of functions to prevent race conditions |
CHANGELOG.md | Updated changelog entry to document the fix for scaler locking |
Comments suppressed due to low confidence (1)
pkg/scaling/cache/scalers_cache.go:73
- [nitpick] Consider improving the error message text for clarity. For example, you might rephrase "scaler with id %d not found. Len = %d" to "Scaler with id %d not found; available scalers: %d" to better convey the issue.
if index < 0 || index >= len(c.Scalers) {
/run-e2e |
/run-e2e |
Should refreshScaler also be updated? As I look at it now, this method could possibly also get a race condition when the read lock in getScalerBuilder is released before acquiring the write lock in refreshScaler. During this window, another thread could call Close(), setting c.Scalers to nil and causing "scaler with id X not found. Len = 0" errors. But I may be wrong... Proposal:
|
you have a very good point @rickbrouwer, the usage of mutexes in the |
just fyi, I would like to enable go's builtin race detector #6760. It's not perfect and there appears to be a large number of race conditions in KEDA currently so by default it shouldn't block any PR from merging, but this might serve us well in the long run. It even reported race condition resulting in #6725, good news is that it seems to be happy with the fixes presented in this PR. |
I think that it can help, but I also think that it's not totally necessary, as it just replaces one scaler with other new. In any case, I'm updating the PR to include it, just in case. Good point! |
/run-e2e rabbit |
Signed-off-by: Jorge Turrado <[email protected]>
/run-e2e rabbit |
Done! Is it okey to merge @wozniakjan ? |
nice! also looks like the |
/run-e2e rabbit |
I've just rebased my branch just to ensure that |
) Signed-off-by: Jan Wozniak <[email protected]>
* fix: Admission Webhook blocks ScaledObject without metricType with fallback (#6702) * fix: Admission Webhook blocks ScaledObject without metricType with fallback Signed-off-by: rickbrouwer <[email protected]> * Add unit test Signed-off-by: Rick Brouwer <[email protected]> * Add e2e test Signed-off-by: rickbrouwer <[email protected]> * Add more unit tests for scaledobject_types Signed-off-by: Rick Brouwer <[email protected]> * Update changelog Signed-off-by: Rick Brouwer <[email protected]> * Update Signed-off-by: Rick Brouwer <[email protected]> --------- Signed-off-by: rickbrouwer <[email protected]> Signed-off-by: Rick Brouwer <[email protected]> Co-authored-by: Zbynek Roubalik <[email protected]> Signed-off-by: Jan Wozniak <[email protected]> * fix: AWS SQS Queue queueURLFromEnv not working (#6713) Signed-off-by: rickbrouwer <[email protected]> Signed-off-by: Jan Wozniak <[email protected]> * fix: Temporal scaler with API Key (#6707) Signed-off-by: Rick Brouwer <[email protected]> Signed-off-by: rickbrouwer <[email protected]> Signed-off-by: Jan Wozniak <[email protected]> * fix: add default Operation in Azure Service Bus scaler (#6731) Signed-off-by: Rick Brouwer <[email protected]> Signed-off-by: Jan Wozniak <[email protected]> * fix: ScalerCache gets the lock before operate the scalers (#6739) Signed-off-by: Jan Wozniak <[email protected]> * fix: Use pinned version for nginx image (#6737) * fix: Use pinned version for nginx image Signed-off-by: Jorge Turrado <[email protected]> * . Signed-off-by: Jorge Turrado <[email protected]> * fix panic in gcp scaler Signed-off-by: Jorge Turrado <[email protected]> --------- Signed-off-by: Jorge Turrado <[email protected]> Signed-off-by: Jan Wozniak <[email protected]> * Selenium Grid: Update metric name generated without part of empty (#6772) * Selenium Grid: Update metric name generated without part of empty Signed-off-by: Viet Nguyen Duc <[email protected]> * Update CHANGELOG with the PR Signed-off-by: Viet Nguyen Duc <[email protected]> --------- Signed-off-by: Viet Nguyen Duc <[email protected]> Signed-off-by: Jan Wozniak <[email protected]> * chore: changelog and issue template v2.17.1 Signed-off-by: Jan Wozniak <[email protected]> --------- Signed-off-by: rickbrouwer <[email protected]> Signed-off-by: Rick Brouwer <[email protected]> Signed-off-by: Jan Wozniak <[email protected]> Signed-off-by: Jorge Turrado <[email protected]> Signed-off-by: Viet Nguyen Duc <[email protected]> Co-authored-by: rickbrouwer <[email protected]> Co-authored-by: Zbynek Roubalik <[email protected]> Co-authored-by: Jorge Turrado Ferrero <[email protected]> Co-authored-by: Viet Nguyen Duc <[email protected]>
With the current implementation, a race condition can happen because we check the scalers size before signaling the lock, so hypothetically, the value of
c.Scalers
can change between the line when it's checked and the line when is accessed, allowing the case wherec.Scalers
is cleaned and then the index is accessed, generating a panic because of accessing out of the array length.Checking the code and considering the original problem of the issue, my hypothesis is that the scaler fails during the HPA metric request. It calls to GetScaledObjectMetrics and there, after the scaler failure, the whole ScaledObject is refreshed, triggering the only code that changes the value of
c.Scalers
from a valid slice tonil
. (cache Close())If an operator loop starts just after the metric requests, it could happen that the refresh action has been triggered over the current cache item that is in process of being revoked, generating the race condition because the length of
s.Scalers
has been verified without any lock (so in theory the index exists), and then the execution pointer has waited until the write lock is released (whenc.Scalers
is alreadynil
). As soon as the Close() code ends, the lock is released and the code within the refresh function continues, trying to access an index that existed during the check but not during the access because of not locking the read lock before checkingc.Scalers
Checklist
Fixes #6725