Skip to content

Inconsistent store pointers between the region-cache and store-cache cause stale regions to become inaccessible. #1823

@AndreMouche

Description

@AndreMouche

I believe this is a bug related to inconsistent state between region-cache and store-cache when a TiKV store updates its address or labels.

if s.addr != addr || !s.IsSameLabels(store.GetLabels()) {
newStore := newStore(
s.storeID,
addr,
store.GetPeerAddress(),
store.GetStatusAddress(),
storeType,
resolved,
store.GetLabels(),
)
newStore.livenessState = atomic.LoadUint32(&s.livenessState)
if newStore.getLivenessState() != reachable {
newStore.unreachableSince = s.unreachableSince
startHealthCheckLoop(scheduler, c, newStore, newStore.getLivenessState(), storeReResolveInterval)
}
if s.addr == addr {
newStore.healthStatus = s.healthStatus
}
c.put(newStore)
s.setResolveState(deleted)
logutil.BgLogger().Info("store address or labels changed, add new store and mark old store deleted",
zap.Uint64("store", s.storeID),
zap.String("old-addr", s.addr),
zap.Any("old-labels", s.labels),
zap.String("old-liveness", s.getLivenessState().String()),
zap.String("new-addr", newStore.addr),
zap.Any("new-labels", newStore.labels),
zap.String("new-liveness", newStore.getLivenessState().String()))

From the above code, when we update the address or lable of a TiKV instance, a new store will be created and replace the old one in store-cache, we can confirm this by the log

 store address or labels changed, add new store and mark old store deleted...

However, since we do not replace the new store in region-cache, for region with its leader from region-cache on this tikv, the status will never change and keeps unavailable

if isSamePeer(replica.peer, leader) {
// If hibernate region is enabled and the leader is not reachable, the raft group
// will not be wakened up and re-elect the leader until the follower receives
// a request. So, before the new leader is elected, we should not send requests
// to the unreachable old leader to avoid unnecessary timeout.
if replica.store.getLivenessState() != reachable {
return -1
}

When accessing new regions that were not previously cached, the new store point is used and the leader may became available

We do have a related issue #1401 , and a related fix #1402,
However, it only stop the health check for the old store object, which still not replace the store-pointer in region-cache.

Here is my question, why we do not reuse the old store object directly instead of create a new one?

Workaround: restart the TiDB instance

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions