vllm: support mooncake_kv_store + expose mooncake_master metrics by esmeetu · Pull Request #157 · NVIDIA/srt-slurm

esmeetu · 2026-05-14T06:18:41Z

Summary

Extends mooncake_master orchestration (previously SGLang-only) to the vLLM backend so vLLM workers can use the in-process MooncakeStoreConnector for cross-process KV sharing. Also wires the master's admin HTTP server (/metrics, /health, etc.) on a configurable port so ops can scrape Prometheus metrics.

What's new

VLLMMooncakeKVStoreConfig (backends/vllm.py): mirrors the SGLang mooncake_kv_store: block and adds a vLLM-specific store_config: section. srtslurm renders that section into the MOONCAKE_CONFIG_PATH JSON file vLLM's MooncakeStoreConnector reads on startup (MooncakeStoreConfig.load_from_env()); the env: map is injected on every vLLM worker for in-process MC_* knobs (e.g. MC_ENABLE_DEST_DEVICE_AFFINITY, MC_STORE_CLIENT_METRIC, MC_TE_METRIC).
Shared launch constants moved to a new backends/mooncake.py; the SGLang module re-exports them so existing imports keep working.
start_mooncake_master now fires for vLLM as well as SGLang, writes the per-job store JSON next to the other log artifacts, and stamps the master RPC + HTTP-metadata endpoints on each worker.
Admin HTTP server: master srun now passes --enable_metric_reporting=true --metrics_port=9003 (upstream default in master.cpp), and start_mooncake_master waits on the metrics port like it already waits on RPC + HTTP-metadata. The flag toggles a periodic stdout log thread only — MasterAdminServer listens unconditionally — the new MOONCAKE_METRICS_PORT constant's comment documents that to prevent future reordering. Exposes /metrics, /metrics/summary, /health, /role, /ha_status, /leader, /query_key.

Test plan

make check (ruff + pytest, 732 passed / 2 skipped)
New cases in tests/test_e2e.py cover vLLM mooncake env injection and JSON rendering
New cases in tests/test_dry_run.py cover the new config surface in srtctl dry-run output

Extends mooncake_master orchestration (previously SGLang-only) to the vLLM backend so vLLM workers can use the in-process MooncakeStore connector for cross-process KV sharing. What's new ---------- * `VLLMMooncakeKVStoreConfig` schema (`backends/vllm.py`): mirrors the SGLang `mooncake_kv_store:` block, plus a vLLM-specific `store_config:` section. srtslurm renders that section into the `MOONCAKE_CONFIG_PATH` JSON file vLLM's `MooncakeStoreConnector` reads on startup (`MooncakeStoreConfig.load_from_env()`); the `env:` map is injected on every vLLM worker for the in-process `MC_*` knobs (e.g. `MC_ENABLE_DEST_DEVICE_AFFINITY`, `MC_STORE_CLIENT_METRIC`, `MC_TE_METRIC`). * `start_mooncake_master` in `cli/do_sweep.py` now fires for vLLM as well as SGLang, writes the per-job store JSON next to the other log artifacts, and stamps the master RPC + HTTP-metadata endpoints on each worker. Shared launch constants moved to a new `backends/mooncake.py` (the SGLang module re-exports them so downstream imports keep working). * Mooncake admin HTTP server (`/metrics`, `/health`, `/role`, `/ha_status`, `/leader`, `/query_key`) is now wired up explicitly: the master srun passes `--enable_metric_reporting=true --metrics_port=9003` (upstream default in master.cpp), and `start_mooncake_master` waits on the metrics port like it already waits on RPC + HTTP-metadata. The flag toggles a periodic stdout log thread only — `MasterAdminServer` listens unconditionally — the new constant's comment documents that to prevent future reordering. * Docs (`docs/mooncake-kv-store.md`): adds a vLLM quick start, a vLLM-specific configuration reference (including `store_config`), expands the ownership table for the auto-stamped vars, and documents the new master metrics endpoint. Validation ---------- * Manual: launched a 1P/1D disagg job; master came up with `enable_metric_reporting=1, metrics_port=9003`; vLLM workers picked up `MOONCAKE_CONFIG_PATH` and registered RDMA segments. `curl :9003/metrics` returned a full Prometheus dump. * Tests: `make check` (ruff + pytest). New cases in `tests/test_e2e.py` cover the vLLM mooncake env injection and JSON rendering; `tests/test_dry_run.py` cases cover the new config surface in `srtctl dry-run` output. Signed-off-by: inf-yasong <yasong.wang@inferact.ai>

esmeetu · 2026-05-15T02:07:28Z

Thanks for the review @qiching — addressed all three in 8306191:

endpoints_to_processes signature — added port_allocator: NodePortAllocator | None = None to match BackendProtocol, and threaded it through to topology.endpoints_to_processes(...) (both the standard-TP and DP-mode paths) so per-job port jitter from NodePortAllocator.from_job_id applies uniformly to vLLM jobs as well.
Unrelated NIXL env var — removed VLLM_NIXL_SIDE_CHANNEL_HOST and the get_hostname_ip import; that change isn't mooncake-related and shouldn't have been in this PR.
store_config typing — widened to dict[str, Any] (both the field on VLLMMooncakeKVStoreConfig and the return / local in build_mooncake_store_config). Added a short comment explaining that vLLM's MooncakeStoreConfig is a mix of str, int, and human-readable size strings, so dict[str, str] would force users to quote numeric values unnecessarily.

make check clean (732 passed, 2 skipped).

- Restore BackendProtocol.endpoints_to_processes signature by adding port_allocator parameter and threading it through to the topology helper so per-job port jitter applies uniformly. - Drop unrelated VLLM_NIXL_SIDE_CHANNEL_HOST / get_hostname_ip import that was not mooncake-related. - Widen store_config to dict[str, Any] (matches MooncakeStoreConfig's mix of str/int/size-string fields).

Resolve conflicts from PR NVIDIA#156 (centralized port allocation) which moved port constants out of per-backend modules. - Delete src/srtctl/backends/mooncake.py (constants now live in ports.py) - Add MOONCAKE_METRICS_PORT (8702) to srtctl/ports.py alongside the existing MOONCAKE_MASTER_PORT (8700) and MOONCAKE_HTTP_METADATA_PORT (8701) so all three live in the same 8700-range - Repoint imports in sglang.py, vllm.py, do_sweep.py, submit.py - Refresh docs/mooncake-kv-store.md and CLAUDE.md to use the new ports - Update test_dry_run.py and test_e2e.py to import port constants from srtctl.ports instead of hardcoding 50051 Workers now see MOONCAKE_MASTER=<infra>:8700 (was 50051) and MOONCAKE_TE_META_DATA_SERVER=http://<infra>:8701/metadata. The mooncake_master srun is launched with explicit --port, --http_metadata_server_port, and --metrics_port flags so we don't rely on mooncake's compile-time defaults.

Addresses two review comments on PR NVIDIA#157: - build_mooncake_store_config now merges defaults with the user dict (`{**defaults, **user_cfg}`) and only force-overrides master_server_address. New fields vLLM adds to MooncakeStoreConfig upstream pass through automatically — no code change here needed. - CLAUDE.md mooncake_kv_store section drops the "SGLang only" qualifier and notes the vLLM behavior (env injection + MOONCAKE_CONFIG_PATH JSON). - Add a unit test verifying unknown user keys propagate to the JSON.

…examples - Default `global_segment_size` was 4GB (copied from SGLang quickstart), too small for real workloads — bump to 100GB to match the production docs example. Updated test and docs accordingly. - vLLM example YAML no longer lists `MOONCAKE_PROTOCOL` / `MOONCAKE_DEVICE` / `MOONCAKE_GLOBAL_SEGMENT_SIZE` under `env:` — those are SGLang-only knobs; for vLLM these go in `store_config` (it reads from JSON). Replaced with the `MC_*` knobs that vLLM's in-process Mooncake C++ libs actually read.

…icitly `build_mooncake_store_config` no longer pre-fills `metadata_server`, `global_segment_size`, `local_buffer_size`, `protocol`, `device_name`. These are hardware-specific knobs (HBM size, NIC layout, RDMA vs TCP) — silently using a srtslurm-picked default mis-sizes the KV cache or picks a transport the cluster doesn't support, and the failure mode is slow/wrong instead of loud. Pass-through-only forces users to think about the values explicitly; vLLM throws a clear missing-field error if they're absent. `master_server_address` is still force-overridden (the user can't know the infra IP at config time). - Updated docs and the existing test to assert pass-through semantics - Quickstart examples already list all the fields, so DX is unchanged

The previous example used standalone MooncakeStoreConnector with kv_role kv_producer/kv_consumer, which doesn't reflect how this is actually deployed. Real production form is MultiConnector wrapping NixlConnector (P2P transfer) + MooncakeStoreConnector (shared store), both with kv_role: kv_both so prefill and decode run identical connector stacks. The validator already accepts this form (see schema.py:1363, test_e2e.py:797).

esmeetu · 2026-05-16T00:55:17Z

@qiching It should be clean now.

qiching

LGTM. Good job. Thank you! @esmeetu

esmeetu requested review from alec-flowers, csahithi, hjjq, ishandhanani, kedarpotdar-nv, kyleliang-nv, nlevin-ui and qiching as code owners May 14, 2026 06:18

esmeetu force-pushed the mooncake-store-vllm branch from 8baf3b0 to f7a1016 Compare May 14, 2026 06:22

qiching reviewed May 14, 2026

View reviewed changes

Comment thread src/srtctl/backends/vllm.py Outdated

qiching reviewed May 14, 2026

View reviewed changes

Comment thread src/srtctl/backends/vllm.py Outdated

qiching reviewed May 14, 2026

View reviewed changes

Comment thread src/srtctl/backends/vllm.py Outdated

esmeetu force-pushed the mooncake-store-vllm branch from 8306191 to 407e19d Compare May 15, 2026 02:12

esmeetu force-pushed the mooncake-store-vllm branch from 407e19d to ae711b3 Compare May 15, 2026 02:15

qiching reviewed May 15, 2026

View reviewed changes

Comment thread src/srtctl/backends/vllm.py Outdated

Comment thread CLAUDE.md Outdated

esmeetu added 4 commits May 16, 2026 00:16

qiching approved these changes May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vllm: support mooncake_kv_store + expose mooncake_master metrics#157

vllm: support mooncake_kv_store + expose mooncake_master metrics#157
esmeetu wants to merge 7 commits into
NVIDIA:mainfrom
esmeetu:mooncake-store-vllm

esmeetu commented May 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

esmeetu commented May 15, 2026

Uh oh!

Uh oh!

Uh oh!

esmeetu commented May 16, 2026

Uh oh!

qiching left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

esmeetu commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's new

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

esmeetu commented May 15, 2026

Uh oh!

Uh oh!

Uh oh!

esmeetu commented May 16, 2026

Uh oh!

qiching left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

esmeetu commented May 14, 2026 •

edited

Loading