tests: stabilize TestResourceGroupRUConsumption by okJiang · Pull Request #10462 · tikv/pd

okJiang · 2026-03-19T12:47:03Z

What problem does this PR solve?

Issue Number: close #8739

TestResourceGroupRUConsumption is unstable because it assumes RU consumption
statistics are visible immediately after AcquireTokenBuckets returns.

The test currently sleeps for 10ms and then asserts GetResourceGroup(..., pd.WithRUStats) already reflects the reported consumption. That assumption is
not guaranteed: AcquireTokenBuckets only enqueues consumption into
Manager.consumptionDispatcher, and the actual RUStats update happens later
in backgroundMetricsFlush.

To make that race deterministic, this patch adds a test-only failpoint
delayConsume in backgroundMetricsFlush and enables it in the flaky test.
With delayConsume=100ms, the old fixed-sleep assertion fails reproducibly:
g.RUStats is still zero while testConsumption already contains the expected
values.

What is changed and how does it work?

add a test-only delayConsume failpoint before the background consumption
worker applies RU stats
update TestResourceGroupRUConsumption to wait on the observable condition
instead of sleeping for a fixed 10ms
reuse the same condition-based check after leader transfer so the test waits
for persisted stats to become visible again

This keeps production behavior unchanged. The new failpoint only activates when
a test explicitly enables it.

Root-cause evidence

pkg/mcs/resourcemanager/server/manager.go:730+: consumption is processed
asynchronously from consumptionDispatcher
tests/integrations/mcs/resourcemanager/resource_manager_test.go previously
slept 10ms before asserting RU stats had already updated
red test with delayConsume=100ms:
- GOCACHE=/tmp/pd-go-cache make gotest GOTEST_ARGS='-tags=without_dashboard ./mcs/resourcemanager -run TestResourceManagerClientTestSuite/pd-resource-manager/TestResourceGroupRUConsumption -count=1 -vet=off'
- failed before the fix because g.RUStats remained zero

Historical analog

Playbook Pattern 5: Flaky and Unstable Test Stabilization
Representative merged PRs:
- tests: fix some unstable tests #10134 tests: fix some unstable tests
- *: fix flaky test TestPreparingProgress and TestRemovingProgress #9465 *: fix flaky test TestPreparingProgress and TestRemovingProgress
Prior issue-specific attempt:
- tests: fix multiple flaky tests, panics, deadlocks, and goroutine leaks #10303 tests: fix multiple flaky tests, panics, deadlocks, and goroutine leaks

Risk

scoped to one flaky integration test plus a failpoint-only hook in the
asynchronous RU stats path
no production behavior, API, or config changes unless the failpoint is
explicitly enabled in tests

Verification

red before the fix:
- GOCACHE=/tmp/pd-go-cache make gotest GOTEST_ARGS='-tags=without_dashboard ./mcs/resourcemanager -run TestResourceManagerClientTestSuite/pd-resource-manager/TestResourceGroupRUConsumption -count=1 -vet=off'
- failed deterministically with delayConsume=100ms
green after the fix:
- GOCACHE=/tmp/pd-go-cache make gotest GOTEST_ARGS='-tags=without_dashboard ./mcs/resourcemanager -run TestResourceManagerClientTestSuite/pd-resource-manager/TestResourceGroupRUConsumption -count=1 -vet=off'
- passed
green in standalone RM discovery mode:
- GOCACHE=/tmp/pd-go-cache make gotest GOTEST_ARGS='-tags=without_dashboard ./mcs/resourcemanager -run TestResourceManagerClientTestSuite/standalone-resource-manager-with-client-discovery/TestResourceGroupRUConsumption -count=1 -vet=off'
- passed
touched manager path:
- GOCACHE=/tmp/pd-go-cache make gotest GOTEST_ARGS='./pkg/mcs/resourcemanager/server -run TestCleanUpTicker -count=1'
- passed
repo baseline:
- GOCACHE=/tmp/pd-go-cache make basic-test
- not clean in this worktree due an unrelated existing failure in
  pkg/mcs/resourcemanager/server TestSetServiceLimit

Check List

Release note

None.

Summary by CodeRabbit

Tests
- Enhanced test robustness for resource consumption metrics by implementing polling-based validation instead of fixed delays, and adding test infrastructure for simulating consumption processing delays.

Signed-off-by: okjiang <819421878@qq.com>

ti-chi-bot · 2026-03-19T12:47:06Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

ti-chi-bot · 2026-03-19T12:47:10Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign husharp for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-03-19T12:47:15Z

📝 Walkthrough

Walkthrough

Added a failpoint-driven delay mechanism to consumption metrics handling in the resource manager and replaced fixed sleep delays in tests with a polling helper function to verify resource group statistics eventually match expected values.

Changes

Cohort / File(s)	Summary
Manager failpoint injection `pkg/mcs/resourcemanager/server/manager.go`	Inserted `delayConsume` failpoint after validating `consumptionInfo`, allowing optional delay before resolving keyspace name and recording metrics.
Test improvements `tests/integrations/mcs/resourcemanager/resource_manager_test.go`	Added failpoint for RU consumption delay; replaced fixed 10ms sleep with polling helper `checkRUStats()` that repeatedly asserts `RUStats` matches expected values across multiple test checkpoints.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Suggested labels

type/development, lgtm, approved

Suggested reviewers

nolouch

Poem

🐰 With failpoint precision, delays gently bend,
Tests now poll with patience, round curve's end,
No rushing through metrics, just graceful waits,
ResourceManager's grace through time's open gates! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'tests: stabilize TestResourceGroupRUConsumption' directly and specifically identifies the main change: stabilizing a flaky test. It is concise, clear, and follows the repository convention.
Description check	✅ Passed	The description includes the issue number, comprehensive explanation of the problem (race condition in RU consumption visibility), the solution (failpoint-driven delay), verification steps, and release notes. All key template sections are addressed.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-03-19T12:58:30Z

Codecov Report

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.93%. Comparing base (eb9079e) to head (79b1811).
⚠️ Report is 21 commits behind head on master.

❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (74.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10462      +/-   ##
==========================================
+ Coverage   78.89%   78.93%   +0.03%     
==========================================
  Files         529      530       +1     
  Lines       71198    71551     +353     
==========================================
+ Hits        56175    56476     +301     
- Misses      11008    11048      +40     
- Partials     4015     4027      +12

Flag	Coverage Δ
unittests	`78.93% <0.00%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: okjiang <819421878@qq.com>

okJiang · 2026-04-03T02:35:15Z

 			if consumptionInfo == nil || consumptionInfo.Consumption == nil {
 				continue
 			}
+			failpoint.Inject("github.com/tikv/pd/pkg/mcs/resourcemanager/server/delayConsume", func(val failpoint.Value) {


This is a test-only failpoint used to reproduce unstable tests. After the review, this failpoint can be removed.

coderabbitai

🧹 Nitpick comments (1)

tests/integrations/mcs/resourcemanager/resource_manager_test.go (1)

1539-1555: Consider adding explicit timeout options to testutil.Eventually.

The polling helper relies on default timeout. Given the 100ms failpoint delay and potential leader transfer scenarios, consider adding explicit timeout/tick options for clearer test behavior and to guard against CI variability.

💡 Suggested improvement

 checkRUStats := func() {
     testutil.Eventually(re, func() bool {
         g, err = cli.GetResourceGroup(suite.ctx, group.Name, pd.WithRUStats)
         if err != nil || g == nil || g.RUStats == nil {
             return false
         }
         return g.RUStats.RRU == testConsumption.RRU &&
             g.RUStats.WRU == testConsumption.WRU &&
             g.RUStats.ReadBytes == testConsumption.ReadBytes &&
             g.RUStats.WriteBytes == testConsumption.WriteBytes &&
             g.RUStats.TotalCpuTimeMs == testConsumption.TotalCpuTimeMs &&
             g.RUStats.SqlLayerCpuTimeMs == testConsumption.SqlLayerCpuTimeMs &&
             g.RUStats.KvReadRpcCount == testConsumption.KvReadRpcCount &&
             g.RUStats.KvWriteRpcCount == testConsumption.KvWriteRpcCount
-    })
+    }, testutil.WithTickInterval(50*time.Millisecond))
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/mcs/resourcemanager/resource_manager_test.go` around lines
1539 - 1555, The test uses testutil.Eventually inside checkRUStats without
explicit timeout/tick options which can flake under the 100ms failpoint and
leader transfer delays; update the call to testutil.Eventually in checkRUStats
to pass explicit options (e.g., a longer timeout and an appropriate
tick/interval) so the polling waits long enough for cli.GetResourceGroup(...,
pd.WithRUStats) to observe the expected RUStats; ensure the options are applied
where testutil.Eventually is invoked in the checkRUStats closure.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/integrations/mcs/resourcemanager/resource_manager_test.go`:
- Around line 1539-1555: The test uses testutil.Eventually inside checkRUStats
without explicit timeout/tick options which can flake under the 100ms failpoint
and leader transfer delays; update the call to testutil.Eventually in
checkRUStats to pass explicit options (e.g., a longer timeout and an appropriate
tick/interval) so the polling waits long enough for cli.GetResourceGroup(...,
pd.WithRUStats) to observe the expected RUStats; ensure the options are applied
where testutil.Eventually is invoked in the checkRUStats closure.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e370915e-a37c-4a39-bd81-0d0d1b856c4f

📥 Commits

Reviewing files that changed from the base of the PR and between da8b62d and 79b1811.

📒 Files selected for processing (2)

pkg/mcs/resourcemanager/server/manager.go
tests/integrations/mcs/resourcemanager/resource_manager_test.go

ti-chi-bot · 2026-04-03T02:40:11Z

@okJiang: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-unit-test-next-gen-2	`79b1811`	link	true	`/test pull-unit-test-next-gen-2`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

lhy1024 · 2026-04-03T02:53:55Z

+			return g.RUStats.RRU == testConsumption.RRU &&
+				g.RUStats.WRU == testConsumption.WRU &&
+				g.RUStats.ReadBytes == testConsumption.ReadBytes &&
+				g.RUStats.WriteBytes == testConsumption.WriteBytes &&
+				g.RUStats.TotalCpuTimeMs == testConsumption.TotalCpuTimeMs &&
+				g.RUStats.SqlLayerCpuTimeMs == testConsumption.SqlLayerCpuTimeMs &&
+				g.RUStats.KvReadRpcCount == testConsumption.KvReadRpcCount &&
+				g.RUStats.KvWriteRpcCount == testConsumption.KvWriteRpcCount
+		})


Will we miss some fields? Or will we be unable to cover new fields added in the future?

If you add a field and want to test it, it should be added to testConsumption and set to a specific value. At that point, the modifier should add the corresponding field again.

tests: stabilize TestResourceGroupRUConsumption

97d244b

Signed-off-by: okjiang <819421878@qq.com>

ti-chi-bot Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 19, 2026

okJiang added 2 commits March 20, 2026 12:03

resourcemanager: fix delayConsume failpoint path

cf1aec6

Signed-off-by: okjiang <819421878@qq.com>

fix failpoint

09db82a

Signed-off-by: okjiang <819421878@qq.com>

ti-chi-bot Bot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 24, 2026

resourcemanager: fix delayConsume duration cast

79b1811

Signed-off-by: okjiang <819421878@qq.com>

okJiang marked this pull request as ready for review April 3, 2026 02:29

ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 3, 2026

okJiang commented Apr 3, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 3, 2026

View reviewed changes

lhy1024 reviewed Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: stabilize TestResourceGroupRUConsumption#10462

tests: stabilize TestResourceGroupRUConsumption#10462
okJiang wants to merge 4 commits intotikv:masterfrom
okJiang:flaky-8739-testresourcegroupruconsumption

okJiang commented Mar 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

ti-chi-bot Bot commented Mar 19, 2026

Uh oh!

ti-chi-bot Bot commented Mar 19, 2026

Uh oh!

coderabbitai Bot commented Mar 19, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

codecov Bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

okJiang Apr 3, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

ti-chi-bot Bot commented Apr 3, 2026

Uh oh!

lhy1024 Apr 3, 2026

Uh oh!

okJiang Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

okJiang commented Mar 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how does it work?

Check List

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot Bot commented Mar 19, 2026

Uh oh!

ti-chi-bot Bot commented Mar 19, 2026

Uh oh!

coderabbitai Bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

codecov Bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

okJiang Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ti-chi-bot Bot commented Apr 3, 2026

Uh oh!

lhy1024 Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

okJiang Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

okJiang commented Mar 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 19, 2026 •

edited

Loading

codecov Bot commented Mar 19, 2026 •

edited

Loading