Skip to content

fix: ScalerCache gets the lock before operate the scalers #6739

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

JorTurFer
Copy link
Member

@JorTurFer JorTurFer commented Apr 27, 2025

With the current implementation, a race condition can happen because we check the scalers size before signaling the lock, so hypothetically, the value of c.Scalers can change between the line when it's checked and the line when is accessed, allowing the case where c.Scalers is cleaned and then the index is accessed, generating a panic because of accessing out of the array length.

Checking the code and considering the original problem of the issue, my hypothesis is that the scaler fails during the HPA metric request. It calls to GetScaledObjectMetrics and there, after the scaler failure, the whole ScaledObject is refreshed, triggering the only code that changes the value of c.Scalers from a valid slice to nil. (cache Close())

If an operator loop starts just after the metric requests, it could happen that the refresh action has been triggered over the current cache item that is in process of being revoked, generating the race condition because the length of s.Scalers has been verified without any lock (so in theory the index exists), and then the execution pointer has waited until the write lock is released (when c.Scalers is already nil). As soon as the Close() code ends, the lock is released and the code within the refresh function continues, trying to access an index that existed during the check but not during the access because of not locking the read lock before checking c.Scalers

Checklist

  • Commits are signed with Developer Certificate of Origin (DCO - learn more)

Fixes #6725

@JorTurFer JorTurFer requested a review from a team as a code owner April 27, 2025 17:48
Signed-off-by: Jorge Turrado <[email protected]>
@JorTurFer
Copy link
Member Author

JorTurFer commented Apr 27, 2025

/run-e2e rabbit
Update: You can check the progress here

@JorTurFer JorTurFer changed the title fix: ScalerCache holds the lock before operate the scalers fix: ScalerCache gets the lock before operate the scalers Apr 27, 2025
@JorTurFer JorTurFer enabled auto-merge (squash) April 27, 2025 18:54
@rickbrouwer
Copy link
Contributor

I think this fix deserves a mention in the Changelog as well, if only for traceability 🙂 Wdyt?

@JorTurFer JorTurFer disabled auto-merge April 28, 2025 07:25
@JorTurFer
Copy link
Member Author

I'll add it later on, yeah you're right
Do you think that this could be the fix for the panic?

@rickbrouwer
Copy link
Contributor

I'll add it later on, yeah you're right Do you think that this could be the fix for the panic?

Yes, I think it is. I can imagine the following:

  1. Thread A checks if an index is valid in c.Scalers (without holding a lock)
  2. Before accessing the slice, Thread B acquires a write lock and sets c.Scalers = nil
  3. Thread A then acquires a read lock and tries to access c.Scalers[index], but c.Scalers is now nil

This causes the panic (with "index out of range" error)

Signed-off-by: Jorge Turrado <[email protected]>
@JorTurFer
Copy link
Member Author

JorTurFer commented Apr 28, 2025

/run-e2e rabbit
Update: You can check the progress here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A slow ScaledObject's SQL query results in keda-operator in CrashLoopBackOff
2 participants