tests: stabilize TestResourceGroupRUConsumption#10462
tests: stabilize TestResourceGroupRUConsumption#10462okJiang wants to merge 4 commits intotikv:masterfrom
Conversation
Signed-off-by: okjiang <819421878@qq.com>
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughAdded a failpoint-driven delay mechanism to consumption metrics handling in the resource manager and replaced fixed sleep delays in tests with a polling helper function to verify resource group statistics eventually match expected values. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (74.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## master #10462 +/- ##
==========================================
+ Coverage 78.89% 78.93% +0.03%
==========================================
Files 529 530 +1
Lines 71198 71551 +353
==========================================
+ Hits 56175 56476 +301
- Misses 11008 11048 +40
- Partials 4015 4027 +12
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Signed-off-by: okjiang <819421878@qq.com>
Signed-off-by: okjiang <819421878@qq.com>
Signed-off-by: okjiang <819421878@qq.com>
| if consumptionInfo == nil || consumptionInfo.Consumption == nil { | ||
| continue | ||
| } | ||
| failpoint.Inject("github.com/tikv/pd/pkg/mcs/resourcemanager/server/delayConsume", func(val failpoint.Value) { |
There was a problem hiding this comment.
This is a test-only failpoint used to reproduce unstable tests. After the review, this failpoint can be removed.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/integrations/mcs/resourcemanager/resource_manager_test.go (1)
1539-1555: Consider adding explicit timeout options totestutil.Eventually.The polling helper relies on default timeout. Given the 100ms failpoint delay and potential leader transfer scenarios, consider adding explicit timeout/tick options for clearer test behavior and to guard against CI variability.
💡 Suggested improvement
checkRUStats := func() { testutil.Eventually(re, func() bool { g, err = cli.GetResourceGroup(suite.ctx, group.Name, pd.WithRUStats) if err != nil || g == nil || g.RUStats == nil { return false } return g.RUStats.RRU == testConsumption.RRU && g.RUStats.WRU == testConsumption.WRU && g.RUStats.ReadBytes == testConsumption.ReadBytes && g.RUStats.WriteBytes == testConsumption.WriteBytes && g.RUStats.TotalCpuTimeMs == testConsumption.TotalCpuTimeMs && g.RUStats.SqlLayerCpuTimeMs == testConsumption.SqlLayerCpuTimeMs && g.RUStats.KvReadRpcCount == testConsumption.KvReadRpcCount && g.RUStats.KvWriteRpcCount == testConsumption.KvWriteRpcCount - }) + }, testutil.WithTickInterval(50*time.Millisecond)) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integrations/mcs/resourcemanager/resource_manager_test.go` around lines 1539 - 1555, The test uses testutil.Eventually inside checkRUStats without explicit timeout/tick options which can flake under the 100ms failpoint and leader transfer delays; update the call to testutil.Eventually in checkRUStats to pass explicit options (e.g., a longer timeout and an appropriate tick/interval) so the polling waits long enough for cli.GetResourceGroup(..., pd.WithRUStats) to observe the expected RUStats; ensure the options are applied where testutil.Eventually is invoked in the checkRUStats closure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/integrations/mcs/resourcemanager/resource_manager_test.go`:
- Around line 1539-1555: The test uses testutil.Eventually inside checkRUStats
without explicit timeout/tick options which can flake under the 100ms failpoint
and leader transfer delays; update the call to testutil.Eventually in
checkRUStats to pass explicit options (e.g., a longer timeout and an appropriate
tick/interval) so the polling waits long enough for cli.GetResourceGroup(...,
pd.WithRUStats) to observe the expected RUStats; ensure the options are applied
where testutil.Eventually is invoked in the checkRUStats closure.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e370915e-a37c-4a39-bd81-0d0d1b856c4f
📒 Files selected for processing (2)
pkg/mcs/resourcemanager/server/manager.gotests/integrations/mcs/resourcemanager/resource_manager_test.go
|
@okJiang: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| return g.RUStats.RRU == testConsumption.RRU && | ||
| g.RUStats.WRU == testConsumption.WRU && | ||
| g.RUStats.ReadBytes == testConsumption.ReadBytes && | ||
| g.RUStats.WriteBytes == testConsumption.WriteBytes && | ||
| g.RUStats.TotalCpuTimeMs == testConsumption.TotalCpuTimeMs && | ||
| g.RUStats.SqlLayerCpuTimeMs == testConsumption.SqlLayerCpuTimeMs && | ||
| g.RUStats.KvReadRpcCount == testConsumption.KvReadRpcCount && | ||
| g.RUStats.KvWriteRpcCount == testConsumption.KvWriteRpcCount | ||
| }) |
There was a problem hiding this comment.
Will we miss some fields? Or will we be unable to cover new fields added in the future?
There was a problem hiding this comment.
If you add a field and want to test it, it should be added to testConsumption and set to a specific value. At that point, the modifier should add the corresponding field again.
What problem does this PR solve?
Issue Number: close #8739
TestResourceGroupRUConsumptionis unstable because it assumes RU consumptionstatistics are visible immediately after
AcquireTokenBucketsreturns.The test currently sleeps for 10ms and then asserts
GetResourceGroup(..., pd.WithRUStats)already reflects the reported consumption. That assumption isnot guaranteed:
AcquireTokenBucketsonly enqueues consumption intoManager.consumptionDispatcher, and the actualRUStatsupdate happens laterin
backgroundMetricsFlush.To make that race deterministic, this patch adds a test-only failpoint
delayConsumeinbackgroundMetricsFlushand enables it in the flaky test.With
delayConsume=100ms, the old fixed-sleep assertion fails reproducibly:g.RUStatsis still zero whiletestConsumptionalready contains the expectedvalues.
What is changed and how does it work?
delayConsumefailpoint before the background consumptionworker applies RU stats
TestResourceGroupRUConsumptionto wait on the observable conditioninstead of sleeping for a fixed 10ms
for persisted stats to become visible again
This keeps production behavior unchanged. The new failpoint only activates when
a test explicitly enables it.
Root-cause evidence
pkg/mcs/resourcemanager/server/manager.go:730+: consumption is processedasynchronously from
consumptionDispatchertests/integrations/mcs/resourcemanager/resource_manager_test.gopreviouslyslept 10ms before asserting RU stats had already updated
delayConsume=100ms:GOCACHE=/tmp/pd-go-cache make gotest GOTEST_ARGS='-tags=without_dashboard ./mcs/resourcemanager -run TestResourceManagerClientTestSuite/pd-resource-manager/TestResourceGroupRUConsumption -count=1 -vet=off'g.RUStatsremained zeroHistorical analog
tests: fix some unstable testsTestPreparingProgressandTestRemovingProgress#9465*: fix flaky test TestPreparingProgress and TestRemovingProgresstests: fix multiple flaky tests, panics, deadlocks, and goroutine leaksRisk
asynchronous RU stats path
explicitly enabled in tests
Verification
GOCACHE=/tmp/pd-go-cache make gotest GOTEST_ARGS='-tags=without_dashboard ./mcs/resourcemanager -run TestResourceManagerClientTestSuite/pd-resource-manager/TestResourceGroupRUConsumption -count=1 -vet=off'delayConsume=100msGOCACHE=/tmp/pd-go-cache make gotest GOTEST_ARGS='-tags=without_dashboard ./mcs/resourcemanager -run TestResourceManagerClientTestSuite/pd-resource-manager/TestResourceGroupRUConsumption -count=1 -vet=off'GOCACHE=/tmp/pd-go-cache make gotest GOTEST_ARGS='-tags=without_dashboard ./mcs/resourcemanager -run TestResourceManagerClientTestSuite/standalone-resource-manager-with-client-discovery/TestResourceGroupRUConsumption -count=1 -vet=off'GOCACHE=/tmp/pd-go-cache make gotest GOTEST_ARGS='./pkg/mcs/resourcemanager/server -run TestCleanUpTicker -count=1'GOCACHE=/tmp/pd-go-cache make basic-testpkg/mcs/resourcemanager/server TestSetServiceLimitCheck List
Release note
Summary by CodeRabbit