fix: ScalerCache gets the lock before operate the scalers #6739

JorTurFer · 2025-04-27T17:48:04Z

With the current implementation, a race condition can happen because we check the scalers size before signaling the lock, so hypothetically, the value of c.Scalers can change between the line when it's checked and the line when is accessed, allowing the case where c.Scalers is cleaned and then the index is accessed, generating a panic because of accessing out of the array length.

Checking the code and considering the original problem of the issue, my hypothesis is that the scaler fails during the HPA metric request. It calls to GetScaledObjectMetrics and there, after the scaler failure, the whole ScaledObject is refreshed, triggering the only code that changes the value of c.Scalers from a valid slice to nil. (cache Close())

If an operator loop starts just after the metric requests, it could happen that the refresh action has been triggered over the current cache item that is in process of being revoked, generating the race condition because the length of s.Scalers has been verified without any lock (so in theory the index exists), and then the execution pointer has waited until the write lock is released (when c.Scalers is already nil). As soon as the Close() code ends, the lock is released and the code within the refresh function continues, trying to access an index that existed during the check but not during the access because of not locking the read lock before checking c.Scalers

Checklist

Commits are signed with Developer Certificate of Origin (DCO - learn more)

Fixes #6725

Signed-off-by: Jorge Turrado <[email protected]>

JorTurFer · 2025-04-27T18:11:15Z

/run-e2e rabbit
Update: You can check the progress here

rickbrouwer · 2025-04-28T07:24:25Z

I think this fix deserves a mention in the Changelog as well, if only for traceability 🙂 Wdyt?

JorTurFer · 2025-04-28T07:25:58Z

I'll add it later on, yeah you're right
Do you think that this could be the fix for the panic?

rickbrouwer · 2025-04-28T08:02:12Z

I'll add it later on, yeah you're right Do you think that this could be the fix for the panic?

Yes, I think it is. I can imagine the following:

Thread A checks if an index is valid in c.Scalers (without holding a lock)
Before accessing the slice, Thread B acquires a write lock and sets c.Scalers = nil
Thread A then acquires a read lock and tries to access c.Scalers[index], but c.Scalers is now nil

This causes the panic (with "index out of range" error)

Signed-off-by: Jorge Turrado <[email protected]>

JorTurFer · 2025-04-28T13:16:32Z

/run-e2e rabbit
Update: You can check the progress here

wozniakjan · 2025-05-02T09:18:39Z

Do you think that this could be the fix for the panic?

good find! It's for sure going to make the code more robust. Regarding #6739 - I'm optimistic, it should imho also fix it, the panic stack trace aligns with the changes

zroubalik · 2025-05-02T13:22:38Z

/run-e2e
Update: You can check the progress here

Copilot

Pull Request Overview

This PR fixes a potential race condition in ScalersCache by acquiring the lock before operating on the scalers, ensuring that the slice is not modified concurrently.

Updated lock acquisition in getScalerBuilder and GetPushScalers
Adjusted the changelog entry to document the fix

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
pkg/scaling/cache/scalers_cache.go	Moved the read lock to the beginning of functions to prevent race conditions
CHANGELOG.md	Updated changelog entry to document the fix for scaler locking

Comments suppressed due to low confidence (1)

pkg/scaling/cache/scalers_cache.go:73

[nitpick] Consider improving the error message text for clarity. For example, you might rephrase "scaler with id %d not found. Len = %d" to "Scaler with id %d not found; available scalers: %d" to better convey the issue.

if index < 0 || index >= len(c.Scalers) {

CHANGELOG.md

JorTurFer · 2025-05-04T19:13:26Z

/run-e2e
Update: You can check the progress here

zroubalik · 2025-05-05T09:08:50Z

/run-e2e
Update: You can check the progress here

rickbrouwer · 2025-05-05T17:11:50Z

Should refreshScaler also be updated? As I look at it now, this method could possibly also get a race condition when the read lock in getScalerBuilder is released before acquiring the write lock in refreshScaler. During this window, another thread could call Close(), setting c.Scalers to nil and causing "scaler with id X not found. Len = 0" errors. But I may be wrong...

Proposal:

func (c *ScalersCache) refreshScaler(ctx context.Context, index int) (scalers.Scaler, error) {
    c.mutex.Lock()
    defer c.mutex.Unlock()
    
    if index < 0 || index >= len(c.Scalers) {
        return nil, fmt.Errorf("scaler with id %d not found. Len = %d", index, len(c.Scalers))
    }
    
    oldSb := c.Scalers[index]
    
    newScaler, sConfig, err := oldSb.Factory()
    if err != nil {
        return nil, err
    }
    
    c.Scalers[index] = ScalerBuilder{
        Scaler:       newScaler,
        ScalerConfig: *sConfig,
        Factory:      oldSb.Factory,
    }
    
    oldSb.Scaler.Close(ctx)
    
    return newScaler, nil
}

wozniakjan · 2025-05-06T13:26:09Z

Should refreshScaler also be updated?

you have a very good point @rickbrouwer, the usage of mutexes in the ScalersCache doesn't seem to be correct for refreshScaler method either. Your proposed change essentially inlines the getScalerBuilder implementation but with proper usage of mutex locking it even for write, which in this case is imho desired, wdyt @JorTurFer?

wozniakjan · 2025-05-06T14:34:46Z

just fyi, I would like to enable go's builtin race detector #6760. It's not perfect and there appears to be a large number of race conditions in KEDA currently so by default it shouldn't block any PR from merging, but this might serve us well in the long run.

It even reported race condition resulting in #6725, good news is that it seems to be happy with the fixes presented in this PR.

JorTurFer · 2025-05-12T20:09:35Z

Should refreshScaler also be updated? As I look at it now, this method could possibly also get a race condition when the read lock in getScalerBuilder is released before acquiring the write lock in refreshScaler. During this window, another thread could call Close(), setting c.Scalers to nil and causing "scaler with id X not found. Len = 0" errors. But I may be wrong...

Proposal:
func (c *ScalersCache) refreshScaler(ctx context.Context, index int) (scalers.Scaler, error) {
    c.mutex.Lock()
    defer c.mutex.Unlock()
    
    if index < 0 || index >= len(c.Scalers) {
        return nil, fmt.Errorf("scaler with id %d not found. Len = %d", index, len(c.Scalers))
    }
    
    oldSb := c.Scalers[index]
    
    newScaler, sConfig, err := oldSb.Factory()
    if err != nil {
        return nil, err
    }
    
    c.Scalers[index] = ScalerBuilder{
        Scaler:       newScaler,
        ScalerConfig: *sConfig,
        Factory:      oldSb.Factory,
    }
    
    oldSb.Scaler.Close(ctx)
    
    return newScaler, nil
}

I think that it can help, but I also think that it's not totally necessary, as it just replaces one scaler with other new. In any case, I'm updating the PR to include it, just in case. Good point!
I'm wondering if we should get rid of getScalerBuilder directly, as it uses a read mutex inside but it can return a scaler that it's immediately closed just after returning it. There are some tricky conditions, so maybe we can start with this as suggested and iterate if needed

JorTurFer · 2025-05-12T20:14:32Z

/run-e2e rabbit
Update: You can check the progress here

Signed-off-by: Jorge Turrado <[email protected]>

JorTurFer · 2025-05-12T20:15:05Z

/run-e2e rabbit
Update: You can check the progress here

JorTurFer · 2025-05-12T20:15:31Z

Done! Is it okey to merge @wozniakjan ?

wozniakjan · 2025-05-13T06:44:30Z

nice! also looks like the -race detector now has no complains, so we can think about enabling it by default

JorTurFer · 2025-05-13T06:52:34Z

/run-e2e rabbit
Update: You can check the progress here

JorTurFer · 2025-05-13T06:55:55Z

I've just rebased my branch just to ensure that -race is executed. After all the checks, I'll merge it

)

) Signed-off-by: Jan Wozniak <[email protected]>

* fix: Admission Webhook blocks ScaledObject without metricType with fallback (#6702) * fix: Admission Webhook blocks ScaledObject without metricType with fallback Signed-off-by: rickbrouwer <[email protected]> * Add unit test Signed-off-by: Rick Brouwer <[email protected]> * Add e2e test Signed-off-by: rickbrouwer <[email protected]> * Add more unit tests for scaledobject_types Signed-off-by: Rick Brouwer <[email protected]> * Update changelog Signed-off-by: Rick Brouwer <[email protected]> * Update Signed-off-by: Rick Brouwer <[email protected]> --------- Signed-off-by: rickbrouwer <[email protected]> Signed-off-by: Rick Brouwer <[email protected]> Co-authored-by: Zbynek Roubalik <[email protected]> Signed-off-by: Jan Wozniak <[email protected]> * fix: AWS SQS Queue queueURLFromEnv not working (#6713) Signed-off-by: rickbrouwer <[email protected]> Signed-off-by: Jan Wozniak <[email protected]> * fix: Temporal scaler with API Key (#6707) Signed-off-by: Rick Brouwer <[email protected]> Signed-off-by: rickbrouwer <[email protected]> Signed-off-by: Jan Wozniak <[email protected]> * fix: add default Operation in Azure Service Bus scaler (#6731) Signed-off-by: Rick Brouwer <[email protected]> Signed-off-by: Jan Wozniak <[email protected]> * fix: ScalerCache gets the lock before operate the scalers (#6739) Signed-off-by: Jan Wozniak <[email protected]> * fix: Use pinned version for nginx image (#6737) * fix: Use pinned version for nginx image Signed-off-by: Jorge Turrado <[email protected]> * . Signed-off-by: Jorge Turrado <[email protected]> * fix panic in gcp scaler Signed-off-by: Jorge Turrado <[email protected]> --------- Signed-off-by: Jorge Turrado <[email protected]> Signed-off-by: Jan Wozniak <[email protected]> * Selenium Grid: Update metric name generated without part of empty (#6772) * Selenium Grid: Update metric name generated without part of empty Signed-off-by: Viet Nguyen Duc <[email protected]> * Update CHANGELOG with the PR Signed-off-by: Viet Nguyen Duc <[email protected]> --------- Signed-off-by: Viet Nguyen Duc <[email protected]> Signed-off-by: Jan Wozniak <[email protected]> * chore: changelog and issue template v2.17.1 Signed-off-by: Jan Wozniak <[email protected]> --------- Signed-off-by: rickbrouwer <[email protected]> Signed-off-by: Rick Brouwer <[email protected]> Signed-off-by: Jan Wozniak <[email protected]> Signed-off-by: Jorge Turrado <[email protected]> Signed-off-by: Viet Nguyen Duc <[email protected]> Co-authored-by: rickbrouwer <[email protected]> Co-authored-by: Zbynek Roubalik <[email protected]> Co-authored-by: Jorge Turrado Ferrero <[email protected]> Co-authored-by: Viet Nguyen Duc <[email protected]>

fix: ScalerCache holds the lock before operate the scalers

554a04e

Signed-off-by: Jorge Turrado <[email protected]>

JorTurFer requested a review from a team as a code owner April 27, 2025 17:48

add missing lock

48f92a1

Signed-off-by: Jorge Turrado <[email protected]>

JorTurFer changed the title ~~fix: ScalerCache holds the lock before operate the scalers~~ fix: ScalerCache gets the lock before operate the scalers Apr 27, 2025

JorTurFer enabled auto-merge (squash) April 27, 2025 18:54

JorTurFer disabled auto-merge April 28, 2025 07:25

update changelog

33cf62c

Signed-off-by: Jorge Turrado <[email protected]>

wozniakjan approved these changes May 2, 2025

View reviewed changes

zroubalik requested a review from Copilot May 2, 2025 13:23

Copilot AI reviewed May 2, 2025

View reviewed changes

CHANGELOG.md Show resolved Hide resolved

Merge branch 'main' into fix-panic

de9cbe4

wozniakjan mentioned this pull request May 6, 2025

enable race condition check for unit tests #6760

Merged

2 tasks

wozniakjan mentioned this pull request May 6, 2025

Epic: race conditions in KEDA #6761

Closed

2 tasks

zroubalik approved these changes May 12, 2025

View reviewed changes

apply feedback

56d3a6b

Signed-off-by: Jorge Turrado <[email protected]>

wozniakjan approved these changes May 13, 2025

View reviewed changes

Merge branch 'main' into fix-panic

0c0242c

JorTurFer enabled auto-merge (squash) May 13, 2025 06:56

JorTurFer merged commit a8c0126 into kedacore:main May 13, 2025
18 checks passed

JorTurFer deleted the fix-panic branch May 13, 2025 07:22

rickbrouwer mentioned this pull request May 13, 2025

refactor: simplify scalers cache by removing getScalerBuilder method #6773

Closed

3 tasks

wozniakjan pushed a commit to wozniakjan/keda that referenced this pull request May 15, 2025

fix: ScalerCache gets the lock before operate the scalers (kedacore#6739

cafe65c

)

wozniakjan pushed a commit to wozniakjan/keda that referenced this pull request May 15, 2025

fix: ScalerCache gets the lock before operate the scalers (kedacore#6739

30a77b0

) Signed-off-by: Jan Wozniak <[email protected]>

fix: ScalerCache gets the lock before operate the scalers #6739

fix: ScalerCache gets the lock before operate the scalers #6739

Uh oh!

Conversation

JorTurFer commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

JorTurFer commented Apr 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rickbrouwer commented Apr 28, 2025

Uh oh!

JorTurFer commented Apr 28, 2025

Uh oh!

rickbrouwer commented Apr 28, 2025

Uh oh!

JorTurFer commented Apr 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wozniakjan commented May 2, 2025

Uh oh!

zroubalik commented May 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

JorTurFer commented May 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zroubalik commented May 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rickbrouwer commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wozniakjan commented May 6, 2025

Uh oh!

wozniakjan commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JorTurFer commented May 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JorTurFer commented May 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JorTurFer commented May 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JorTurFer commented May 12, 2025

Uh oh!

wozniakjan commented May 13, 2025

Uh oh!

JorTurFer commented May 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JorTurFer commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JorTurFer commented Apr 27, 2025 •

edited

Loading

JorTurFer commented Apr 27, 2025 •

edited by github-actions bot

Loading

JorTurFer commented Apr 28, 2025 •

edited by github-actions bot

Loading

zroubalik commented May 2, 2025 •

edited by github-actions bot

Loading

JorTurFer commented May 4, 2025 •

edited by github-actions bot

Loading

zroubalik commented May 5, 2025 •

edited by github-actions bot

Loading

rickbrouwer commented May 5, 2025 •

edited

Loading

wozniakjan commented May 6, 2025 •

edited

Loading

JorTurFer commented May 12, 2025 •

edited

Loading

JorTurFer commented May 12, 2025 •

edited by github-actions bot

Loading

JorTurFer commented May 12, 2025 •

edited by github-actions bot

Loading

JorTurFer commented May 13, 2025 •

edited by github-actions bot

Loading

JorTurFer commented May 13, 2025 •

edited

Loading