Skip to content

[Store] fix: reset MasterMetricManager on MasterService recreation in HA mode#2231

Draft
00fish0 wants to merge 1 commit into
kvcache-ai:mainfrom
00fish0:fix-ha-metric-reset
Draft

[Store] fix: reset MasterMetricManager on MasterService recreation in HA mode#2231
00fish0 wants to merge 1 commit into
kvcache-ai:mainfrom
00fish0:fix-ha-metric-reset

Conversation

@00fish0
Copy link
Copy Markdown
Collaborator

@00fish0 00fish0 commented May 26, 2026

Description

Fixes #2210.

In HA mode, MasterService is destroyed and recreated on every etcd reconnect / leader transition, but MasterMetricManager is a process-level singleton that outlives those transitions. The destructor of the old MasterService does not clear the singleton, so values registered during the previous term remain. When the new MasterService runs its restore path and calls inc_* on top of those stale values, gauges double-count and per-term counters silently accumulate across leader terms.

This PR adds MasterMetricManager::ResetTermScopedMetrics() and calls it at the very top of the MasterService constructor, before RestoreState() or any inc_* runs. The new instance starts from a clean baseline and restore re-populates gauges with the actual current state.

Non-HA mode is unaffected: MasterService is constructed exactly once, metrics are already 0, so the reset call is a safe no-op.

Relation to #1186

#1186 raises the same singleton-lifecycle root cause but in more general terms, and it leaves open whether counters should be cleared or synced across leader terms. This PR's scope is just the gauge-class symptom (active_clients, mem/file cache nums, capacities, soft-pin count). The reset helper also clears per-term counters as the simpler default, but that choice is not load-bearing for the bug being fixed here and can be revisited if the project prefers the sync-to-follower alternative.

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • PyTorch Backend (mooncake-pg)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Other

How Has This Been Tested?

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a ResetTermScopedMetrics method to MasterMetricManager to reset term-scoped metrics (gauges and counters) when a new MasterService is constructed in HA mode, preventing double-counting and silent accumulation across leader terms. The review feedback correctly identifies that the dfs_capacity_unlimited_ atomic flag is not reset in this new method, which could lead to persistent stale configurations across terms, and suggests resetting it to ensure a clean slate.

Comment on lines +696 to +701
// Drop the cached summary baseline; otherwise rate calculations would
// mix the previous term's counter deltas with the new term's zero base.
{
std::lock_guard<std::mutex> lock(summary_snapshot_mutex_);
summary_snapshot_ = SummarySnapshot{};
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The dfs_capacity_unlimited_ atomic flag is not reset in ResetTermScopedMetrics(). If a previous MasterService instance set this flag to true (e.g., when global_file_segment_size_ was unlimited), the flag will persist as true in subsequent terms even if the new MasterService instance is configured with a limited capacity.

To ensure a completely clean slate across terms, please reset dfs_capacity_unlimited_ to false during the metrics reset.

    // Drop the cached summary baseline; otherwise rate calculations would
    // mix the previous term's counter deltas with the new term's zero base.
    {
        std::lock_guard<std::mutex> lock(summary_snapshot_mutex_);
        summary_snapshot_ = SummarySnapshot{};
    }
    dfs_capacity_unlimited_ = false;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

1 participant