Skip to content

Flaky: TestKeyspaceGroupTestSuite/TestUpdateMemberWhenRecovery times out with “failed to find the keyspace group” #10551

@wuhuizuo

Description

@wuhuizuo

Flaky Test

The CI job is flaky on 2026-04-02 in PR #10549.

Failing job: PD Test → Microservice Integration(TSO)
Actions job URL:

https://github.com/tikv/pd/actions/runs/23891157334/job/69664606237?pr=10549

Symptom

Integration test fails with timeout:

  • Test: TestKeyspaceGroupTestSuite/TestUpdateMemberWhenRecovery
  • Failure: Condition never satisfied
  • Location: tests/integrations/mcs/keyspace/tso_keyspace_group_test.go:745

Logs repeatedly show:

  • [tso] failed to find the keyspace group (keyspace-id-in-request=1)
  • occasional connection refused
  • and tso stream is not ready

Likely cause

Race during recovery: after restarting a TSO node, the client sees TSO URLs again but keyspace group (ID=1) membership/metadata isn’t observable/consistent yet. The client loops on “failed to find the keyspace group” until the test’s Eventually times out.

Proposed fix (test hardening)

In TestUpdateMemberWhenRecovery, after restarting the TSO node (Step 6) and waiting for primary serving, add an explicit wait until PD API reports keyspace group 1 has members again (optionally that the restarted node is included), before asserting GetTS returns a newer timestamp.

Suggested code sketch:

testutil.Eventually(re, func() bool {
    kg, code := suite.tryGetKeyspaceGroup(re, 1)
    if code != http.StatusOK || kg == nil || len(kg.Members) == 0 {
        return false
    }
    for _, m := range kg.Members {
        if m.Address == newNode.GetAddr() {
            return true
        }
    }
    return false
}, testutil.WithWaitFor(30*time.Second), testutil.WithTickInterval(200*time.Millisecond))

Notes

This keeps the “no legacy fallback” verification intact (failpoint assertNotReachLegacyPath) while removing the recovery race.

Metadata

Metadata

Assignees

No one assigned

    Labels

    contributionThis PR is from a community contributor.type/ciThe issue is related to CI.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions