-
Notifications
You must be signed in to change notification settings - Fork 250
Description
I believe this is a bug related to inconsistent state between region-cache and store-cache when a TiKV store updates its address or labels.
client-go/internal/locate/store_cache.go
Lines 494 to 521 in 0175881
| if s.addr != addr || !s.IsSameLabels(store.GetLabels()) { | |
| newStore := newStore( | |
| s.storeID, | |
| addr, | |
| store.GetPeerAddress(), | |
| store.GetStatusAddress(), | |
| storeType, | |
| resolved, | |
| store.GetLabels(), | |
| ) | |
| newStore.livenessState = atomic.LoadUint32(&s.livenessState) | |
| if newStore.getLivenessState() != reachable { | |
| newStore.unreachableSince = s.unreachableSince | |
| startHealthCheckLoop(scheduler, c, newStore, newStore.getLivenessState(), storeReResolveInterval) | |
| } | |
| if s.addr == addr { | |
| newStore.healthStatus = s.healthStatus | |
| } | |
| c.put(newStore) | |
| s.setResolveState(deleted) | |
| logutil.BgLogger().Info("store address or labels changed, add new store and mark old store deleted", | |
| zap.Uint64("store", s.storeID), | |
| zap.String("old-addr", s.addr), | |
| zap.Any("old-labels", s.labels), | |
| zap.String("old-liveness", s.getLivenessState().String()), | |
| zap.String("new-addr", newStore.addr), | |
| zap.Any("new-labels", newStore.labels), | |
| zap.String("new-liveness", newStore.getLivenessState().String())) |
From the above code, when we update the address or lable of a TiKV instance, a new store will be created and replace the old one in store-cache, we can confirm this by the log
store address or labels changed, add new store and mark old store deleted...
However, since we do not replace the new store in region-cache, for region with its leader from region-cache on this tikv, the status will never change and keeps unavailable
client-go/internal/locate/region_request.go
Lines 804 to 811 in 0175881
| if isSamePeer(replica.peer, leader) { | |
| // If hibernate region is enabled and the leader is not reachable, the raft group | |
| // will not be wakened up and re-elect the leader until the follower receives | |
| // a request. So, before the new leader is elected, we should not send requests | |
| // to the unreachable old leader to avoid unnecessary timeout. | |
| if replica.store.getLivenessState() != reachable { | |
| return -1 | |
| } |
When accessing new regions that were not previously cached, the new store point is used and the leader may became available
We do have a related issue #1401 , and a related fix #1402,
However, it only stop the health check for the old store object, which still not replace the store-pointer in region-cache.
Here is my question, why we do not reuse the old store object directly instead of create a new one?
Workaround: restart the TiDB instance