fix: resolve data race in storagev2 cacheProvider#173
Conversation
Create a copy of the cache value instead of modifying it in-place to avoid race conditions when multiple goroutines access the same resolverCacheValue concurrently. This fixes the data race detected by the Go race detector in the storagev2 cacheProvider. Fixes #172 Generated with [codeagent](https://github.com/qbox/codeagent) Co-authored-by: zhangzqs <34616640+zhangzqs@users.noreply.github.com>
|
Hi @xgopilot[bot]. Thanks for your PR. I'm waiting for a qiniu member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@xgopilot 检查下修改是否完整 |
修改完整性检查
分析结果经过详细检查,本次修复是完整且正确的。 ✅ FeedbackGood() 方法修复正确该方法之前直接修改缓存中的
修复方案通过创建副本避免了原地修改: newRcv := &resolverCacheValue{
IPs: make([]resolverCacheValueIP, len(rcv.IPs)),
RefreshAfter: rcv.RefreshAfter,
ExpiredAt: rcv.ExpiredAt,
}
copy(newRcv.IPs, rcv.IPs)✅ 其他方法无需修改
总结本PR完整解决了 #172 中报告的数据竞态问题。修复方案遵循了"复制-修改-存储"模式,确保每个 goroutine 使用自己的数据副本,避免了并发访问冲突。代码修改最小化且逻辑清晰,不会影响现有功能。 |
|
@xgopilot 有没有更优雅的修改方案,这里的方法可能会被频繁调用,就会导致数组被频繁创建新的,要避免创建新数组 |
优化完成:使用 RWMutex 替代数组拷贝已采用更优雅的方案重构代码,避免频繁创建数组。 方案对比之前的方案(数组拷贝): // 每次调用都创建新数组
newRcv := &resolverCacheValue{
IPs: make([]resolverCacheValueIP, len(rcv.IPs)),
...
}
copy(newRcv.IPs, rcv.IPs)优化后的方案(读写锁): type resolverCacheValue struct {
mu sync.RWMutex // 新增读写锁
IPs []resolverCacheValueIP
RefreshAfter time.Time
ExpiredAt time.Time
}实现细节
性能优势
变更已提交提交哈希: |
Replace array copying approach with sync.RWMutex for better performance: - Add RWMutex to resolverCacheValue struct - Use RLock for concurrent read operations (Resolve, IsValid, ShouldRefresh) - Use Lock for write operations (FeedbackGood) - Zero memory allocation overhead - Better performance in high-frequency call scenarios Generated with [codeagent](https://github.com/qbox/codeagent) Co-authored-by: YangSen-qn <49863787+YangSen-qn@users.noreply.github.com>
|
@xgopilot review 修改的代码修改是否完善,以及避免死锁 |
代码审查:发现严重死锁风险及其他问题经过详细审查,发现以下问题: 🚨 严重问题:潜在死锁风险位置: 问题代码: func (left *resolverCacheValue) IsEqual(rightValue cache.CacheValue) bool {
if right, ok := rightValue.(*resolverCacheValue); ok {
left.mu.RLock()
defer left.mu.RUnlock()
right.mu.RLock() // ⚠️ 死锁风险
defer right.mu.RUnlock()
// ...
}
}死锁场景:
解决方案:使用一致的锁顺序,例如按内存地址排序: func (left *resolverCacheValue) IsEqual(rightValue cache.CacheValue) bool {
if right, ok := rightValue.(*resolverCacheValue); ok {
// 按内存地址排序锁定,避免死锁
first, second := left, right
if uintptr(unsafe.Pointer(left)) > uintptr(unsafe.Pointer(right)) {
first, second = right, left
}
first.mu.RLock()
defer first.mu.RUnlock()
second.mu.RLock()
defer second.mu.RUnlock()
if len(left.IPs) != len(right.IPs) {
return false
}
for idx := range left.IPs {
if !left.IPs[idx].IP.Equal(right.IPs[idx].IP) ||
!left.IPs[idx].ExpiredAt.Equal(right.IPs[idx].ExpiredAt) {
return false
}
}
return true
}
return false
}
|
|
@xgopilot 尝试处理上面 review 的问题 |
修复完成:解决死锁风险和时间一致性问题已成功修复代码审查中发现的两个问题。 修复详情1. 🔒 IsEqual() 方法死锁风险修复问题:两个 goroutine 以相反顺序比较同一对 cache value 时会发生 AB-BA 死锁。 解决方案: func (left *resolverCacheValue) IsEqual(rightValue cache.CacheValue) bool {
if right, ok := rightValue.(*resolverCacheValue); ok {
// 按内存地址排序锁定,避免死锁
first, second := left, right
if uintptr(unsafe.Pointer(left)) > uintptr(unsafe.Pointer(right)) {
first, second = right, left
}
first.mu.RLock()
defer first.mu.RUnlock()
second.mu.RLock()
defer second.mu.RUnlock()
// ...
}
}原理:通过始终按照一致的顺序(内存地址从小到大)锁定互斥锁,避免了循环等待的条件,从根本上消除死锁风险。 2. ⏱️ Resolve() 方法时间一致性修复问题:在获取写锁前后使用同一时间戳判断过期,可能导致时间窗口不准确。 解决方案: if expiredCount > 0 {
rcv.mu.Lock()
// 重新获取时间戳,确保判断准确
nowLocked := time.Now()
validIPs := make([]resolverCacheValueIP, 0, len(rcv.IPs)-expiredCount)
for _, cacheValueIP := range rcv.IPs {
if cacheValueIP.ExpiredAt.After(nowLocked) {
validIPs = append(validIPs, cacheValueIP)
}
}
rcv.IPs = validIPs
rcv.mu.Unlock()
}原理:在获取写锁后重新获取当前时间,确保在清理过期 IP 时使用最新的时间基准,避免边界情况下的判断误差。 变更已提交
修改影响
|
1. Fix potential AB-BA deadlock in IsEqual() method: - Lock mutexes in consistent order based on pointer address - Prevents deadlock when two goroutines compare same values in reverse order 2. Improve time consistency in Resolve() method: - Re-capture timestamp after acquiring write lock - Ensures accurate expiration checks in the cleanup phase Generated with [codeagent](https://github.com/qbox/codeagent) Co-authored-by: YangSen-qn <49863787+YangSen-qn@users.noreply.github.com>
Requested by @zhangzqs
Summary
This PR fixes the data race issue in storagev2 package's cacheProvider component, specifically in the
FeedbackGood()method ofcacheResolver.Problem
The race detector detected concurrent read/write operations on shared memory locations within the
resolverCacheValuestruct. Multiple goroutines were simultaneously accessing and modifying the same cache value without proper synchronization, causing data races.Root Cause
The issue was in the
FeedbackGood()method where the code directly modified the cachedresolverCacheValueobject in place:rcv.IPs[i].ExpiredAt(line 266)rcv.RefreshAfter(line 271)rcv.ExpiredAt(line 272)When multiple goroutines called this method concurrently, they would read and write to the same memory locations simultaneously.
Solution
Instead of modifying the cached object in-place, the fix creates a copy of the
resolverCacheValuebefore making any modifications:resolverCacheValuestructThis ensures that each goroutine works with its own copy of the data, eliminating the race condition.
Test Plan
Fixes #172
Generated with codeagent