Skip to content

tests: stabilize TestResourceGroupRUConsumption#10462

Open
okJiang wants to merge 4 commits intotikv:masterfrom
okJiang:flaky-8739-testresourcegroupruconsumption
Open

tests: stabilize TestResourceGroupRUConsumption#10462
okJiang wants to merge 4 commits intotikv:masterfrom
okJiang:flaky-8739-testresourcegroupruconsumption

Conversation

@okJiang
Copy link
Copy Markdown
Member

@okJiang okJiang commented Mar 19, 2026

What problem does this PR solve?

Issue Number: close #8739

TestResourceGroupRUConsumption is unstable because it assumes RU consumption
statistics are visible immediately after AcquireTokenBuckets returns.

The test currently sleeps for 10ms and then asserts GetResourceGroup(..., pd.WithRUStats) already reflects the reported consumption. That assumption is
not guaranteed: AcquireTokenBuckets only enqueues consumption into
Manager.consumptionDispatcher, and the actual RUStats update happens later
in backgroundMetricsFlush.

To make that race deterministic, this patch adds a test-only failpoint
delayConsume in backgroundMetricsFlush and enables it in the flaky test.
With delayConsume=100ms, the old fixed-sleep assertion fails reproducibly:
g.RUStats is still zero while testConsumption already contains the expected
values.

What is changed and how does it work?

  • add a test-only delayConsume failpoint before the background consumption
    worker applies RU stats
  • update TestResourceGroupRUConsumption to wait on the observable condition
    instead of sleeping for a fixed 10ms
  • reuse the same condition-based check after leader transfer so the test waits
    for persisted stats to become visible again

This keeps production behavior unchanged. The new failpoint only activates when
a test explicitly enables it.

Root-cause evidence

  • pkg/mcs/resourcemanager/server/manager.go:730+: consumption is processed
    asynchronously from consumptionDispatcher
  • tests/integrations/mcs/resourcemanager/resource_manager_test.go previously
    slept 10ms before asserting RU stats had already updated
  • red test with delayConsume=100ms:
    • GOCACHE=/tmp/pd-go-cache make gotest GOTEST_ARGS='-tags=without_dashboard ./mcs/resourcemanager -run TestResourceManagerClientTestSuite/pd-resource-manager/TestResourceGroupRUConsumption -count=1 -vet=off'
    • failed before the fix because g.RUStats remained zero

Historical analog

Risk

  • scoped to one flaky integration test plus a failpoint-only hook in the
    asynchronous RU stats path
  • no production behavior, API, or config changes unless the failpoint is
    explicitly enabled in tests

Verification

  • red before the fix:
    • GOCACHE=/tmp/pd-go-cache make gotest GOTEST_ARGS='-tags=without_dashboard ./mcs/resourcemanager -run TestResourceManagerClientTestSuite/pd-resource-manager/TestResourceGroupRUConsumption -count=1 -vet=off'
    • failed deterministically with delayConsume=100ms
  • green after the fix:
    • GOCACHE=/tmp/pd-go-cache make gotest GOTEST_ARGS='-tags=without_dashboard ./mcs/resourcemanager -run TestResourceManagerClientTestSuite/pd-resource-manager/TestResourceGroupRUConsumption -count=1 -vet=off'
    • passed
  • green in standalone RM discovery mode:
    • GOCACHE=/tmp/pd-go-cache make gotest GOTEST_ARGS='-tags=without_dashboard ./mcs/resourcemanager -run TestResourceManagerClientTestSuite/standalone-resource-manager-with-client-discovery/TestResourceGroupRUConsumption -count=1 -vet=off'
    • passed
  • touched manager path:
    • GOCACHE=/tmp/pd-go-cache make gotest GOTEST_ARGS='./pkg/mcs/resourcemanager/server -run TestCleanUpTicker -count=1'
    • passed
  • repo baseline:
    • GOCACHE=/tmp/pd-go-cache make basic-test
    • not clean in this worktree due an unrelated existing failure in
      pkg/mcs/resourcemanager/server TestSetServiceLimit

Check List

Release note

None.

Summary by CodeRabbit

  • Tests
    • Enhanced test robustness for resource consumption metrics by implementing polling-based validation instead of fixed delays, and adding test infrastructure for simulating consumption processing delays.

Signed-off-by: okjiang <819421878@qq.com>
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Mar 19, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 19, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Mar 19, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign husharp for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 19, 2026

📝 Walkthrough

Walkthrough

Added a failpoint-driven delay mechanism to consumption metrics handling in the resource manager and replaced fixed sleep delays in tests with a polling helper function to verify resource group statistics eventually match expected values.

Changes

Cohort / File(s) Summary
Manager failpoint injection
pkg/mcs/resourcemanager/server/manager.go
Inserted delayConsume failpoint after validating consumptionInfo, allowing optional delay before resolving keyspace name and recording metrics.
Test improvements
tests/integrations/mcs/resourcemanager/resource_manager_test.go
Added failpoint for RU consumption delay; replaced fixed 10ms sleep with polling helper checkRUStats() that repeatedly asserts RUStats matches expected values across multiple test checkpoints.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Suggested labels

type/development, lgtm, approved

Suggested reviewers

  • nolouch

Poem

🐰 With failpoint precision, delays gently bend,
Tests now poll with patience, round curve's end,
No rushing through metrics, just graceful waits,
ResourceManager's grace through time's open gates! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'tests: stabilize TestResourceGroupRUConsumption' directly and specifically identifies the main change: stabilizing a flaky test. It is concise, clear, and follows the repository convention.
Description check ✅ Passed The description includes the issue number, comprehensive explanation of the problem (race condition in RU consumption visibility), the solution (failpoint-driven delay), verification steps, and release notes. All key template sections are addressed.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 19, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 19, 2026

Codecov Report

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.93%. Comparing base (eb9079e) to head (79b1811).
⚠️ Report is 21 commits behind head on master.

❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (74.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10462      +/-   ##
==========================================
+ Coverage   78.89%   78.93%   +0.03%     
==========================================
  Files         529      530       +1     
  Lines       71198    71551     +353     
==========================================
+ Hits        56175    56476     +301     
- Misses      11008    11048      +40     
- Partials     4015     4027      +12     
Flag Coverage Δ
unittests 78.93% <0.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

okJiang added 2 commits March 20, 2026 12:03
Signed-off-by: okjiang <819421878@qq.com>
Signed-off-by: okjiang <819421878@qq.com>
@ti-chi-bot ti-chi-bot Bot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 24, 2026
Signed-off-by: okjiang <819421878@qq.com>
@okJiang okJiang marked this pull request as ready for review April 3, 2026 02:29
@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 3, 2026
if consumptionInfo == nil || consumptionInfo.Consumption == nil {
continue
}
failpoint.Inject("github.com/tikv/pd/pkg/mcs/resourcemanager/server/delayConsume", func(val failpoint.Value) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a test-only failpoint used to reproduce unstable tests. After the review, this failpoint can be removed.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/integrations/mcs/resourcemanager/resource_manager_test.go (1)

1539-1555: Consider adding explicit timeout options to testutil.Eventually.

The polling helper relies on default timeout. Given the 100ms failpoint delay and potential leader transfer scenarios, consider adding explicit timeout/tick options for clearer test behavior and to guard against CI variability.

💡 Suggested improvement
 checkRUStats := func() {
     testutil.Eventually(re, func() bool {
         g, err = cli.GetResourceGroup(suite.ctx, group.Name, pd.WithRUStats)
         if err != nil || g == nil || g.RUStats == nil {
             return false
         }
         return g.RUStats.RRU == testConsumption.RRU &&
             g.RUStats.WRU == testConsumption.WRU &&
             g.RUStats.ReadBytes == testConsumption.ReadBytes &&
             g.RUStats.WriteBytes == testConsumption.WriteBytes &&
             g.RUStats.TotalCpuTimeMs == testConsumption.TotalCpuTimeMs &&
             g.RUStats.SqlLayerCpuTimeMs == testConsumption.SqlLayerCpuTimeMs &&
             g.RUStats.KvReadRpcCount == testConsumption.KvReadRpcCount &&
             g.RUStats.KvWriteRpcCount == testConsumption.KvWriteRpcCount
-    })
+    }, testutil.WithTickInterval(50*time.Millisecond))
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/mcs/resourcemanager/resource_manager_test.go` around lines
1539 - 1555, The test uses testutil.Eventually inside checkRUStats without
explicit timeout/tick options which can flake under the 100ms failpoint and
leader transfer delays; update the call to testutil.Eventually in checkRUStats
to pass explicit options (e.g., a longer timeout and an appropriate
tick/interval) so the polling waits long enough for cli.GetResourceGroup(...,
pd.WithRUStats) to observe the expected RUStats; ensure the options are applied
where testutil.Eventually is invoked in the checkRUStats closure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/integrations/mcs/resourcemanager/resource_manager_test.go`:
- Around line 1539-1555: The test uses testutil.Eventually inside checkRUStats
without explicit timeout/tick options which can flake under the 100ms failpoint
and leader transfer delays; update the call to testutil.Eventually in
checkRUStats to pass explicit options (e.g., a longer timeout and an appropriate
tick/interval) so the polling waits long enough for cli.GetResourceGroup(...,
pd.WithRUStats) to observe the expected RUStats; ensure the options are applied
where testutil.Eventually is invoked in the checkRUStats closure.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e370915e-a37c-4a39-bd81-0d0d1b856c4f

📥 Commits

Reviewing files that changed from the base of the PR and between da8b62d and 79b1811.

📒 Files selected for processing (2)
  • pkg/mcs/resourcemanager/server/manager.go
  • tests/integrations/mcs/resourcemanager/resource_manager_test.go

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 3, 2026

@okJiang: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen-2 79b1811 link true /test pull-unit-test-next-gen-2

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Comment on lines +1545 to +1553
return g.RUStats.RRU == testConsumption.RRU &&
g.RUStats.WRU == testConsumption.WRU &&
g.RUStats.ReadBytes == testConsumption.ReadBytes &&
g.RUStats.WriteBytes == testConsumption.WriteBytes &&
g.RUStats.TotalCpuTimeMs == testConsumption.TotalCpuTimeMs &&
g.RUStats.SqlLayerCpuTimeMs == testConsumption.SqlLayerCpuTimeMs &&
g.RUStats.KvReadRpcCount == testConsumption.KvReadRpcCount &&
g.RUStats.KvWriteRpcCount == testConsumption.KvWriteRpcCount
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we miss some fields? Or will we be unable to cover new fields added in the future?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you add a field and want to test it, it should be added to testConsumption and set to a specific value. At that point, the modifier should add the corresponding field again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TestResourceGroupRUConsumption is unstable

2 participants