-
Notifications
You must be signed in to change notification settings - Fork 228
fix: eliminate data race in backgroundPing() #933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi, I’m Jit, a friendly security platform designed to help developers build secure applications from day zero with an MVS (Minimal viable security) mindset. In case there are security findings, they will be communicated to you as a comment inside the PR. Hope you’ll enjoy using Jit. Questions? Comments? Want to learn more? Get in touch with us. |
|
Should we re-run the tests? |
|
Hi @heynemann, Thanks for the PR. First, I do see the Second, are you sure that there are races on the |
|
Running my tests with race detection shows that race. I'm not 100% sure that it will happen in reality but since it's such a simple fix I thought about giving it a go and contributing. Happy to change the implementation if there's a better way. |
|
I tried both. Good news! Moving the invocation aroundDoesn't make the race condition go away. Test results: The explanation being that the racing _background() goroutine is launched dynamically from inside the ping callback (line 673), not from the initial p.background() call in _newPipe(). prev atomicThis one you got 100% right! The race detector never reported a race on Thanks for the amazing review!!! Hopefully this is better now! |
|
Hi @heynemann,
Oh, that makes sense, and I missed that the background goroutine will only be started at initialization on the condition However, what we need to do here isn't simply wrap the timer with an atomic variable, but instead, we want to make sure the timer is initialized before further concurrent accesses. In other words, we don't want the timer to be possibly if t := p.pingTimer.Load(); t != nil {
t.Reset(p.pinggap) // we don't want this to be possibly missed
}If timer could be So,
func (p *pipe) backgroundPing() {
var prev, recv int32
+ var mu sync.Mutex
+ mu.Lock()
+ defer mu.Unlock()
....
p.pingTimer = time.AfterFunc(p.pinggap, func() {
+ mu.Lock()
+ defer mu.Unlock()
....
})
}With these, we should be able to make sure that the |
|
Once again thanks for the great timely review. Hopefully this new version is better. But don't hesitate to ask if I'm missing something! |
Fixes a critical data race on the p.pingTimer field that occurs when _background() goroutines are launched dynamically from ping callbacks. Root Cause: The pingTimer field is accessed concurrently without synchronization: - Write: backgroundPing() initializes p.pingTimer - Read: _background() accesses p.pingTimer during cleanup - Write: Timer callbacks call p.pingTimer.Reset() to reschedule The race occurs because _background() goroutines can be created dynamically from inside the ping timer callback. When the callback invokes p.Do() to send a PING command, Do() may call p.background() which launches a new _background() goroutine that races with concurrent timer accesses. Solution: - Added pingTimerMu sync.Mutex to protect pingTimer access - Mutex is held during timer initialization in backgroundPing() - Mutex is held during each timer callback execution - Reordered p.backgroundPing() before p.background() in _newPipe() The mutex ensures: 1. Timer is fully initialized before any concurrent access 2. Timer callbacks execute sequentially (no concurrent callbacks) 3. Reset() calls are properly synchronized 4. No nil timer checks needed - guaranteed non-nil after init No deadlock occurs because the mutex locks are sequential in time: - Lock redis#1: Acquired during backgroundPing() initialization, released when backgroundPing() returns (before callback can fire) - Lock redis#2: Acquired when timer fires (after p.pinggap delay), released when callback completes - These locks are separated by the timer delay, so they never nest Testing: - Added 3 comprehensive regression tests to detect this race - All tests pass with -race detector enabled - Verified with full Docker test suite (Redis Cluster, Sentinel, etc.) - 99.5% code coverage maintained - Zero regressions, fully backward compatible
Signed-off-by: Rueian <[email protected]>
Fixes two critical data races in the backgroundPing() function (pipe.go:653-680)
that could cause unpredictable behavior in high-concurrency production deployments,
particularly with Redis Cluster and rapid client lifecycle scenarios.
Root Causes:
Race on 'prev' variable: Shared int32 accessed concurrently by multiple
timer callbacks without synchronization. Multiple callbacks could read/write
'prev' simultaneously when timers fired in rapid succession.
Race on 'p.pingTimer' field: Timer pointer written during initialization
and read during cleanup (_background(), Close()) with no synchronization,
causing concurrent access violations.
Solution:
Testing:
Sentinel, KeyDB, DragonflyDB, Kvrocks, and RedisSearch
Impact: