-
Notifications
You must be signed in to change notification settings - Fork 758
Flaky: TestKeyspaceGroupTestSuite/TestUpdateMemberWhenRecovery times out with “failed to find the keyspace group” #10551
Description
Flaky Test
The CI job is flaky on 2026-04-02 in PR #10549.
Failing job: PD Test → Microservice Integration(TSO)
Actions job URL:
https://github.com/tikv/pd/actions/runs/23891157334/job/69664606237?pr=10549
Symptom
Integration test fails with timeout:
- Test:
TestKeyspaceGroupTestSuite/TestUpdateMemberWhenRecovery - Failure:
Condition never satisfied - Location:
tests/integrations/mcs/keyspace/tso_keyspace_group_test.go:745
Logs repeatedly show:
[tso] failed to find the keyspace group(keyspace-id-in-request=1)- occasional
connection refused - and
tso stream is not ready
Likely cause
Race during recovery: after restarting a TSO node, the client sees TSO URLs again but keyspace group (ID=1) membership/metadata isn’t observable/consistent yet. The client loops on “failed to find the keyspace group” until the test’s Eventually times out.
Proposed fix (test hardening)
In TestUpdateMemberWhenRecovery, after restarting the TSO node (Step 6) and waiting for primary serving, add an explicit wait until PD API reports keyspace group 1 has members again (optionally that the restarted node is included), before asserting GetTS returns a newer timestamp.
Suggested code sketch:
testutil.Eventually(re, func() bool {
kg, code := suite.tryGetKeyspaceGroup(re, 1)
if code != http.StatusOK || kg == nil || len(kg.Members) == 0 {
return false
}
for _, m := range kg.Members {
if m.Address == newNode.GetAddr() {
return true
}
}
return false
}, testutil.WithWaitFor(30*time.Second), testutil.WithTickInterval(200*time.Millisecond))Notes
This keeps the “no legacy fallback” verification intact (failpoint assertNotReachLegacyPath) while removing the recovery race.