[Store] fix: reset MasterMetricManager on MasterService recreation in HA mode#2231
[Store] fix: reset MasterMetricManager on MasterService recreation in HA mode#223100fish0 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a ResetTermScopedMetrics method to MasterMetricManager to reset term-scoped metrics (gauges and counters) when a new MasterService is constructed in HA mode, preventing double-counting and silent accumulation across leader terms. The review feedback correctly identifies that the dfs_capacity_unlimited_ atomic flag is not reset in this new method, which could lead to persistent stale configurations across terms, and suggests resetting it to ensure a clean slate.
| // Drop the cached summary baseline; otherwise rate calculations would | ||
| // mix the previous term's counter deltas with the new term's zero base. | ||
| { | ||
| std::lock_guard<std::mutex> lock(summary_snapshot_mutex_); | ||
| summary_snapshot_ = SummarySnapshot{}; | ||
| } |
There was a problem hiding this comment.
The dfs_capacity_unlimited_ atomic flag is not reset in ResetTermScopedMetrics(). If a previous MasterService instance set this flag to true (e.g., when global_file_segment_size_ was unlimited), the flag will persist as true in subsequent terms even if the new MasterService instance is configured with a limited capacity.
To ensure a completely clean slate across terms, please reset dfs_capacity_unlimited_ to false during the metrics reset.
// Drop the cached summary baseline; otherwise rate calculations would
// mix the previous term's counter deltas with the new term's zero base.
{
std::lock_guard<std::mutex> lock(summary_snapshot_mutex_);
summary_snapshot_ = SummarySnapshot{};
}
dfs_capacity_unlimited_ = false;
Description
Fixes #2210.
In HA mode,
MasterServiceis destroyed and recreated on every etcd reconnect / leader transition, butMasterMetricManageris a process-level singleton that outlives those transitions. The destructor of the oldMasterServicedoes not clear the singleton, so values registered during the previous term remain. When the newMasterServiceruns its restore path and callsinc_*on top of those stale values, gauges double-count and per-term counters silently accumulate across leader terms.This PR adds
MasterMetricManager::ResetTermScopedMetrics()and calls it at the very top of theMasterServiceconstructor, beforeRestoreState()or anyinc_*runs. The new instance starts from a clean baseline and restore re-populates gauges with the actual current state.Non-HA mode is unaffected:
MasterServiceis constructed exactly once, metrics are already 0, so the reset call is a safe no-op.Relation to #1186
#1186 raises the same singleton-lifecycle root cause but in more general terms, and it leaves open whether counters should be cleared or synced across leader terms. This PR's scope is just the gauge-class symptom (active_clients, mem/file cache nums, capacities, soft-pin count). The reset helper also clears per-term counters as the simpler default, but that choice is not load-bearing for the bug being fixed here and can be revisited if the project prefers the sync-to-follower alternative.
Module
mooncake-transfer-engine)mooncake-store)mooncake-ep)mooncake-integration)mooncake-p2p-store)mooncake-wheel)mooncake-pg)mooncake-rl)Type of Change
How Has This Been Tested?
Checklist
./scripts/code_format.shbefore submitting.