Inconsistent store pointers between the region-cache and store-cache cause stale regions to become inaccessible.

I believe this is a bug related to inconsistent state between region-cache and store-cache when a TiKV store updates its address or labels.

https://github.com/tikv/client-go/blob/01758810e8419b784c0b652ad32ef03664df50bd/internal/locate/store_cache.go#L494-L521

From the above code, when we update the address or lable of a TiKV instance, a new store will be created and replace the old one in store-cache, we can confirm this by the log
```
 store address or labels changed, add new store and mark old store deleted...
```
However, since we do not replace the new store in `region-cache`, for region with its leader from region-cache on this tikv, the status will never change and keeps `unavailable` 
https://github.com/tikv/client-go/blob/01758810e8419b784c0b652ad32ef03664df50bd/internal/locate/region_request.go#L804-L811 

When accessing new regions that were not previously cached, the new store point is used and the leader may became available

We do have a related issue https://github.com/tikv/client-go/issues/1401 , and a related fix https://github.com/tikv/client-go/pull/1402,
However, it only stop the health check for the old store object, which still not replace the store-pointer in region-cache.

Here is my question, why we do not reuse the old store object directly instead of create a new one?

Workaround: restart the TiDB instance



	if s.addr != addr \|\| !s.IsSameLabels(store.GetLabels()) {
	newStore := newStore(
	s.storeID,
	addr,
	store.GetPeerAddress(),
	store.GetStatusAddress(),
	storeType,
	resolved,
	store.GetLabels(),
	)
	newStore.livenessState = atomic.LoadUint32(&s.livenessState)
	if newStore.getLivenessState() != reachable {
	newStore.unreachableSince = s.unreachableSince
	startHealthCheckLoop(scheduler, c, newStore, newStore.getLivenessState(), storeReResolveInterval)
	}
	if s.addr == addr {
	newStore.healthStatus = s.healthStatus
	}
	c.put(newStore)
	s.setResolveState(deleted)
	logutil.BgLogger().Info("store address or labels changed, add new store and mark old store deleted",
	zap.Uint64("store", s.storeID),
	zap.String("old-addr", s.addr),
	zap.Any("old-labels", s.labels),
	zap.String("old-liveness", s.getLivenessState().String()),
	zap.String("new-addr", newStore.addr),
	zap.Any("new-labels", newStore.labels),
	zap.String("new-liveness", newStore.getLivenessState().String()))

	if isSamePeer(replica.peer, leader) {
	// If hibernate region is enabled and the leader is not reachable, the raft group
	// will not be wakened up and re-elect the leader until the follower receives
	// a request. So, before the new leader is elected, we should not send requests
	// to the unreachable old leader to avoid unnecessary timeout.
	if replica.store.getLivenessState() != reachable {
	return -1
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistent store pointers between the region-cache and store-cache cause stale regions to become inaccessible. #1823

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistent store pointers between the region-cache and store-cache cause stale regions to become inaccessible. #1823

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions