Skip to content

fix: ScalerCache gets the lock before operate the scalers #6739

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 13, 2025

Conversation

JorTurFer
Copy link
Member

@JorTurFer JorTurFer commented Apr 27, 2025

With the current implementation, a race condition can happen because we check the scalers size before signaling the lock, so hypothetically, the value of c.Scalers can change between the line when it's checked and the line when is accessed, allowing the case where c.Scalers is cleaned and then the index is accessed, generating a panic because of accessing out of the array length.

Checking the code and considering the original problem of the issue, my hypothesis is that the scaler fails during the HPA metric request. It calls to GetScaledObjectMetrics and there, after the scaler failure, the whole ScaledObject is refreshed, triggering the only code that changes the value of c.Scalers from a valid slice to nil. (cache Close())

If an operator loop starts just after the metric requests, it could happen that the refresh action has been triggered over the current cache item that is in process of being revoked, generating the race condition because the length of s.Scalers has been verified without any lock (so in theory the index exists), and then the execution pointer has waited until the write lock is released (when c.Scalers is already nil). As soon as the Close() code ends, the lock is released and the code within the refresh function continues, trying to access an index that existed during the check but not during the access because of not locking the read lock before checking c.Scalers

Checklist

  • Commits are signed with Developer Certificate of Origin (DCO - learn more)

Fixes #6725

@JorTurFer JorTurFer requested a review from a team as a code owner April 27, 2025 17:48
Signed-off-by: Jorge Turrado <[email protected]>
@JorTurFer
Copy link
Member Author

JorTurFer commented Apr 27, 2025

/run-e2e rabbit
Update: You can check the progress here

@JorTurFer JorTurFer changed the title fix: ScalerCache holds the lock before operate the scalers fix: ScalerCache gets the lock before operate the scalers Apr 27, 2025
@JorTurFer JorTurFer enabled auto-merge (squash) April 27, 2025 18:54
@rickbrouwer
Copy link
Member

I think this fix deserves a mention in the Changelog as well, if only for traceability 🙂 Wdyt?

@JorTurFer JorTurFer disabled auto-merge April 28, 2025 07:25
@JorTurFer
Copy link
Member Author

I'll add it later on, yeah you're right
Do you think that this could be the fix for the panic?

@rickbrouwer
Copy link
Member

I'll add it later on, yeah you're right Do you think that this could be the fix for the panic?

Yes, I think it is. I can imagine the following:

  1. Thread A checks if an index is valid in c.Scalers (without holding a lock)
  2. Before accessing the slice, Thread B acquires a write lock and sets c.Scalers = nil
  3. Thread A then acquires a read lock and tries to access c.Scalers[index], but c.Scalers is now nil

This causes the panic (with "index out of range" error)

Signed-off-by: Jorge Turrado <[email protected]>
@JorTurFer
Copy link
Member Author

JorTurFer commented Apr 28, 2025

/run-e2e rabbit
Update: You can check the progress here

@wozniakjan
Copy link
Member

Do you think that this could be the fix for the panic?

good find! It's for sure going to make the code more robust. Regarding #6739 - I'm optimistic, it should imho also fix it, the panic stack trace aligns with the changes

@zroubalik
Copy link
Member

zroubalik commented May 2, 2025

/run-e2e
Update: You can check the progress here

@zroubalik zroubalik requested a review from Copilot May 2, 2025 13:23
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a potential race condition in ScalersCache by acquiring the lock before operating on the scalers, ensuring that the slice is not modified concurrently.

  • Updated lock acquisition in getScalerBuilder and GetPushScalers
  • Adjusted the changelog entry to document the fix

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
pkg/scaling/cache/scalers_cache.go Moved the read lock to the beginning of functions to prevent race conditions
CHANGELOG.md Updated changelog entry to document the fix for scaler locking
Comments suppressed due to low confidence (1)

pkg/scaling/cache/scalers_cache.go:73

  • [nitpick] Consider improving the error message text for clarity. For example, you might rephrase "scaler with id %d not found. Len = %d" to "Scaler with id %d not found; available scalers: %d" to better convey the issue.
if index < 0 || index >= len(c.Scalers) {

@JorTurFer
Copy link
Member Author

JorTurFer commented May 4, 2025

/run-e2e
Update: You can check the progress here

@zroubalik
Copy link
Member

zroubalik commented May 5, 2025

/run-e2e
Update: You can check the progress here

@rickbrouwer
Copy link
Member

rickbrouwer commented May 5, 2025

Should refreshScaler also be updated? As I look at it now, this method could possibly also get a race condition when the read lock in getScalerBuilder is released before acquiring the write lock in refreshScaler. During this window, another thread could call Close(), setting c.Scalers to nil and causing "scaler with id X not found. Len = 0" errors. But I may be wrong...

Proposal:

func (c *ScalersCache) refreshScaler(ctx context.Context, index int) (scalers.Scaler, error) {
    c.mutex.Lock()
    defer c.mutex.Unlock()
    
    if index < 0 || index >= len(c.Scalers) {
        return nil, fmt.Errorf("scaler with id %d not found. Len = %d", index, len(c.Scalers))
    }
    
    oldSb := c.Scalers[index]
    
    newScaler, sConfig, err := oldSb.Factory()
    if err != nil {
        return nil, err
    }
    
    c.Scalers[index] = ScalerBuilder{
        Scaler:       newScaler,
        ScalerConfig: *sConfig,
        Factory:      oldSb.Factory,
    }
    
    oldSb.Scaler.Close(ctx)
    
    return newScaler, nil
}

@wozniakjan
Copy link
Member

Should refreshScaler also be updated?

you have a very good point @rickbrouwer, the usage of mutexes in the ScalersCache doesn't seem to be correct for refreshScaler method either. Your proposed change essentially inlines the getScalerBuilder implementation but with proper usage of mutex locking it even for write, which in this case is imho desired, wdyt @JorTurFer?

@wozniakjan
Copy link
Member

wozniakjan commented May 6, 2025

just fyi, I would like to enable go's builtin race detector #6760. It's not perfect and there appears to be a large number of race conditions in KEDA currently so by default it shouldn't block any PR from merging, but this might serve us well in the long run.

It even reported race condition resulting in #6725, good news is that it seems to be happy with the fixes presented in this PR.

@wozniakjan wozniakjan mentioned this pull request May 6, 2025
2 tasks
@JorTurFer
Copy link
Member Author

JorTurFer commented May 12, 2025

Should refreshScaler also be updated? As I look at it now, this method could possibly also get a race condition when the read lock in getScalerBuilder is released before acquiring the write lock in refreshScaler. During this window, another thread could call Close(), setting c.Scalers to nil and causing "scaler with id X not found. Len = 0" errors. But I may be wrong...

Proposal:

func (c *ScalersCache) refreshScaler(ctx context.Context, index int) (scalers.Scaler, error) {
    c.mutex.Lock()
    defer c.mutex.Unlock()
    
    if index < 0 || index >= len(c.Scalers) {
        return nil, fmt.Errorf("scaler with id %d not found. Len = %d", index, len(c.Scalers))
    }
    
    oldSb := c.Scalers[index]
    
    newScaler, sConfig, err := oldSb.Factory()
    if err != nil {
        return nil, err
    }
    
    c.Scalers[index] = ScalerBuilder{
        Scaler:       newScaler,
        ScalerConfig: *sConfig,
        Factory:      oldSb.Factory,
    }
    
    oldSb.Scaler.Close(ctx)
    
    return newScaler, nil
}

I think that it can help, but I also think that it's not totally necessary, as it just replaces one scaler with other new. In any case, I'm updating the PR to include it, just in case. Good point!
I'm wondering if we should get rid of getScalerBuilder directly, as it uses a read mutex inside but it can return a scaler that it's immediately closed just after returning it. There are some tricky conditions, so maybe we can start with this as suggested and iterate if needed

@JorTurFer
Copy link
Member Author

JorTurFer commented May 12, 2025

/run-e2e rabbit
Update: You can check the progress here

Signed-off-by: Jorge Turrado <[email protected]>
@JorTurFer
Copy link
Member Author

JorTurFer commented May 12, 2025

/run-e2e rabbit
Update: You can check the progress here

@JorTurFer
Copy link
Member Author

Done! Is it okey to merge @wozniakjan ?

@wozniakjan
Copy link
Member

nice! also looks like the -race detector now has no complains, so we can think about enabling it by default

@JorTurFer
Copy link
Member Author

JorTurFer commented May 13, 2025

/run-e2e rabbit
Update: You can check the progress here

@JorTurFer
Copy link
Member Author

JorTurFer commented May 13, 2025

I've just rebased my branch just to ensure that -race is executed. After all the checks, I'll merge it

@JorTurFer JorTurFer enabled auto-merge (squash) May 13, 2025 06:56
@JorTurFer JorTurFer merged commit a8c0126 into kedacore:main May 13, 2025
18 checks passed
@JorTurFer JorTurFer deleted the fix-panic branch May 13, 2025 07:22
wozniakjan pushed a commit to wozniakjan/keda that referenced this pull request May 15, 2025
wozniakjan pushed a commit to wozniakjan/keda that referenced this pull request May 15, 2025
wozniakjan added a commit that referenced this pull request May 15, 2025
* fix: Admission Webhook blocks ScaledObject without metricType with fallback (#6702)

* fix: Admission Webhook blocks ScaledObject without metricType with fallback

Signed-off-by: rickbrouwer <[email protected]>

* Add unit test

Signed-off-by: Rick Brouwer <[email protected]>

* Add e2e test

Signed-off-by: rickbrouwer <[email protected]>

* Add more unit tests for scaledobject_types

Signed-off-by: Rick Brouwer <[email protected]>

* Update changelog

Signed-off-by: Rick Brouwer <[email protected]>

* Update

Signed-off-by: Rick Brouwer <[email protected]>

---------

Signed-off-by: rickbrouwer <[email protected]>
Signed-off-by: Rick Brouwer <[email protected]>
Co-authored-by: Zbynek Roubalik <[email protected]>
Signed-off-by: Jan Wozniak <[email protected]>

* fix: AWS SQS Queue queueURLFromEnv not working (#6713)

Signed-off-by: rickbrouwer <[email protected]>
Signed-off-by: Jan Wozniak <[email protected]>

* fix: Temporal scaler with API Key (#6707)

Signed-off-by: Rick Brouwer <[email protected]>
Signed-off-by: rickbrouwer <[email protected]>
Signed-off-by: Jan Wozniak <[email protected]>

* fix: add default Operation in Azure Service Bus scaler (#6731)

Signed-off-by: Rick Brouwer <[email protected]>
Signed-off-by: Jan Wozniak <[email protected]>

* fix: ScalerCache gets the lock before operate the scalers (#6739)

Signed-off-by: Jan Wozniak <[email protected]>

* fix: Use pinned version for nginx image (#6737)

* fix: Use pinned version for nginx image

Signed-off-by: Jorge Turrado <[email protected]>

* .

Signed-off-by: Jorge Turrado <[email protected]>

* fix panic in gcp scaler

Signed-off-by: Jorge Turrado <[email protected]>

---------

Signed-off-by: Jorge Turrado <[email protected]>
Signed-off-by: Jan Wozniak <[email protected]>

* Selenium Grid: Update metric name generated without part of empty (#6772)

* Selenium Grid: Update metric name generated without part of empty

Signed-off-by: Viet Nguyen Duc <[email protected]>

* Update CHANGELOG with the PR

Signed-off-by: Viet Nguyen Duc <[email protected]>

---------

Signed-off-by: Viet Nguyen Duc <[email protected]>
Signed-off-by: Jan Wozniak <[email protected]>

* chore: changelog and issue template v2.17.1

Signed-off-by: Jan Wozniak <[email protected]>

---------

Signed-off-by: rickbrouwer <[email protected]>
Signed-off-by: Rick Brouwer <[email protected]>
Signed-off-by: Jan Wozniak <[email protected]>
Signed-off-by: Jorge Turrado <[email protected]>
Signed-off-by: Viet Nguyen Duc <[email protected]>
Co-authored-by: rickbrouwer <[email protected]>
Co-authored-by: Zbynek Roubalik <[email protected]>
Co-authored-by: Jorge Turrado Ferrero <[email protected]>
Co-authored-by: Viet Nguyen Duc <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A slow ScaledObject's SQL query results in keda-operator in CrashLoopBackOff
4 participants