[Store] fix: reset MasterMetricManager on MasterService recreation in HA mode by 00fish0 · Pull Request #2231 · kvcache-ai/Mooncake

00fish0 · 2026-05-26T08:58:42Z

Description

In HA mode, MasterService is destroyed and recreated on every etcd reconnect / leader transition, but MasterMetricManager is a process-level singleton that outlives those transitions. The destructor of the old MasterService does not clear the singleton, so values registered during the previous term remain. When the new MasterService runs its restore path and calls inc_* on top of those stale values, gauges double-count and per-term counters silently accumulate across leader terms.

This PR adds MasterMetricManager::ResetTermScopedMetrics() and calls it at the very top of the MasterService constructor, before RestoreState() or any inc_* runs. The new instance starts from a clean baseline and restore re-populates gauges with the actual current state.

Non-HA mode is unaffected: MasterService is constructed exactly once, metrics are already 0, so the reset call is a safe no-op.

Relation to #1186

#1186 raises the same singleton-lifecycle root cause but in more general terms, and it leaves open whether counters should be cleared or synced across leader terms. This PR's scope is just the gauge-class symptom (active_clients, mem/file cache nums, capacities, soft-pin count). The reset helper also clears per-term counters as the simpler default, but that choice is not load-bearing for the bug being fixed here and can be revisited if the project prefers the sync-to-follower alternative.

Module

Type of Change

How Has This Been Tested?

Checklist

I have performed a self-review of my own code.
I have formatted my own code using ./scripts/code_format.sh before submitting.
I have updated the documentation.
I have added tests to prove my changes are effective.

gemini-code-assist

Code Review

This pull request introduces a ResetTermScopedMetrics method to MasterMetricManager to reset term-scoped metrics (gauges and counters) when a new MasterService is constructed in HA mode, preventing double-counting and silent accumulation across leader terms. The review feedback correctly identifies that the dfs_capacity_unlimited_ atomic flag is not reset in this new method, which could lead to persistent stale configurations across terms, and suggests resetting it to ensure a clean slate.

gemini-code-assist · 2026-05-26T08:59:53Z

+    // Drop the cached summary baseline; otherwise rate calculations would
+    // mix the previous term's counter deltas with the new term's zero base.
+    {
+        std::lock_guard<std::mutex> lock(summary_snapshot_mutex_);
+        summary_snapshot_ = SummarySnapshot{};
+    }


The dfs_capacity_unlimited_ atomic flag is not reset in ResetTermScopedMetrics(). If a previous MasterService instance set this flag to true (e.g., when global_file_segment_size_ was unlimited), the flag will persist as true in subsequent terms even if the new MasterService instance is configured with a limited capacity.

To ensure a completely clean slate across terms, please reset dfs_capacity_unlimited_ to false during the metrics reset.

// Drop the cached summary baseline; otherwise rate calculations would // mix the previous term's counter deltas with the new term's zero base. { std::lock_guard<std::mutex> lock(summary_snapshot_mutex_); summary_snapshot_ = SummarySnapshot{}; } dfs_capacity_unlimited_ = false;

fix(ha): reset metric

dee6faa

github-actions Bot added run-ci Store labels May 26, 2026

gemini-code-assist Bot reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Store] fix: reset MasterMetricManager on MasterService recreation in HA mode#2231

[Store] fix: reset MasterMetricManager on MasterService recreation in HA mode#2231
00fish0 wants to merge 1 commit into
kvcache-ai:mainfrom
00fish0:fix-ha-metric-reset

00fish0 commented May 26, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

00fish0 commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Relation to #1186

Module

Type of Change

How Has This Been Tested?

Checklist

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

00fish0 commented May 26, 2026 •

edited

Loading