Skip to content

vllm: support mooncake_kv_store + expose mooncake_master metrics#157

Open
esmeetu wants to merge 7 commits into
NVIDIA:mainfrom
esmeetu:mooncake-store-vllm
Open

vllm: support mooncake_kv_store + expose mooncake_master metrics#157
esmeetu wants to merge 7 commits into
NVIDIA:mainfrom
esmeetu:mooncake-store-vllm

Conversation

@esmeetu
Copy link
Copy Markdown

@esmeetu esmeetu commented May 14, 2026

Summary

Extends mooncake_master orchestration (previously SGLang-only) to the vLLM backend so vLLM workers can use the in-process MooncakeStoreConnector for cross-process KV sharing. Also wires the master's admin HTTP server (/metrics, /health, etc.) on a configurable port so ops can scrape Prometheus metrics.

What's new

  • VLLMMooncakeKVStoreConfig (backends/vllm.py): mirrors the SGLang mooncake_kv_store: block and adds a vLLM-specific store_config: section. srtslurm renders that section into the MOONCAKE_CONFIG_PATH JSON file vLLM's MooncakeStoreConnector reads on startup (MooncakeStoreConfig.load_from_env()); the env: map is injected on every vLLM worker for in-process MC_* knobs (e.g. MC_ENABLE_DEST_DEVICE_AFFINITY, MC_STORE_CLIENT_METRIC, MC_TE_METRIC).
  • Shared launch constants moved to a new backends/mooncake.py; the SGLang module re-exports them so existing imports keep working.
  • start_mooncake_master now fires for vLLM as well as SGLang, writes the per-job store JSON next to the other log artifacts, and stamps the master RPC + HTTP-metadata endpoints on each worker.
  • Admin HTTP server: master srun now passes --enable_metric_reporting=true --metrics_port=9003 (upstream default in master.cpp), and start_mooncake_master waits on the metrics port like it already waits on RPC + HTTP-metadata. The flag toggles a periodic stdout log thread only — MasterAdminServer listens unconditionally — the new MOONCAKE_METRICS_PORT constant's comment documents that to prevent future reordering. Exposes /metrics, /metrics/summary, /health, /role, /ha_status, /leader, /query_key.

Test plan

  • make check (ruff + pytest, 732 passed / 2 skipped)
  • New cases in tests/test_e2e.py cover vLLM mooncake env injection and JSON rendering
  • New cases in tests/test_dry_run.py cover the new config surface in srtctl dry-run output

Extends mooncake_master orchestration (previously SGLang-only) to the
vLLM backend so vLLM workers can use the in-process MooncakeStore
connector for cross-process KV sharing.

What's new
----------

* `VLLMMooncakeKVStoreConfig` schema (`backends/vllm.py`): mirrors the
  SGLang `mooncake_kv_store:` block, plus a vLLM-specific
  `store_config:` section. srtslurm renders that section into the
  `MOONCAKE_CONFIG_PATH` JSON file vLLM's `MooncakeStoreConnector`
  reads on startup (`MooncakeStoreConfig.load_from_env()`); the
  `env:` map is injected on every vLLM worker for the in-process
  `MC_*` knobs (e.g. `MC_ENABLE_DEST_DEVICE_AFFINITY`,
  `MC_STORE_CLIENT_METRIC`, `MC_TE_METRIC`).

* `start_mooncake_master` in `cli/do_sweep.py` now fires for vLLM as
  well as SGLang, writes the per-job store JSON next to the other
  log artifacts, and stamps the master RPC + HTTP-metadata endpoints
  on each worker. Shared launch constants moved to a new
  `backends/mooncake.py` (the SGLang module re-exports them so
  downstream imports keep working).

* Mooncake admin HTTP server (`/metrics`, `/health`, `/role`,
  `/ha_status`, `/leader`, `/query_key`) is now wired up explicitly:
  the master srun passes `--enable_metric_reporting=true
  --metrics_port=9003` (upstream default in master.cpp), and
  `start_mooncake_master` waits on the metrics port like it already
  waits on RPC + HTTP-metadata. The flag toggles a periodic stdout
  log thread only — `MasterAdminServer` listens unconditionally —
  the new constant's comment documents that to prevent future
  reordering.

* Docs (`docs/mooncake-kv-store.md`): adds a vLLM quick start, a
  vLLM-specific configuration reference (including `store_config`),
  expands the ownership table for the auto-stamped vars, and
  documents the new master metrics endpoint.

Validation
----------

* Manual: launched a 1P/1D disagg job; master came up with
  `enable_metric_reporting=1, metrics_port=9003`; vLLM workers
  picked up `MOONCAKE_CONFIG_PATH` and registered RDMA segments.
  `curl :9003/metrics` returned a full Prometheus dump.

* Tests: `make check` (ruff + pytest). New cases in
  `tests/test_e2e.py` cover the vLLM mooncake env injection and JSON
  rendering; `tests/test_dry_run.py` cases cover the new config
  surface in `srtctl dry-run` output.

Signed-off-by: inf-yasong <yasong.wang@inferact.ai>
@esmeetu esmeetu force-pushed the mooncake-store-vllm branch from 8baf3b0 to f7a1016 Compare May 14, 2026 06:22
Comment thread src/srtctl/backends/vllm.py Outdated
Comment thread src/srtctl/backends/vllm.py Outdated
Comment thread src/srtctl/backends/vllm.py Outdated
@esmeetu
Copy link
Copy Markdown
Author

esmeetu commented May 15, 2026

Thanks for the review @qiching — addressed all three in 8306191:

  1. endpoints_to_processes signature — added port_allocator: NodePortAllocator | None = None to match BackendProtocol, and threaded it through to topology.endpoints_to_processes(...) (both the standard-TP and DP-mode paths) so per-job port jitter from NodePortAllocator.from_job_id applies uniformly to vLLM jobs as well.
  2. Unrelated NIXL env var — removed VLLM_NIXL_SIDE_CHANNEL_HOST and the get_hostname_ip import; that change isn't mooncake-related and shouldn't have been in this PR.
  3. store_config typing — widened to dict[str, Any] (both the field on VLLMMooncakeKVStoreConfig and the return / local in build_mooncake_store_config). Added a short comment explaining that vLLM's MooncakeStoreConfig is a mix of str, int, and human-readable size strings, so dict[str, str] would force users to quote numeric values unnecessarily.

make check clean (732 passed, 2 skipped).

@esmeetu esmeetu force-pushed the mooncake-store-vllm branch from 8306191 to 407e19d Compare May 15, 2026 02:12
- Restore BackendProtocol.endpoints_to_processes signature by adding
  port_allocator parameter and threading it through to the topology
  helper so per-job port jitter applies uniformly.
- Drop unrelated VLLM_NIXL_SIDE_CHANNEL_HOST / get_hostname_ip import
  that was not mooncake-related.
- Widen store_config to dict[str, Any] (matches MooncakeStoreConfig's
  mix of str/int/size-string fields).
@esmeetu esmeetu force-pushed the mooncake-store-vllm branch from 407e19d to ae711b3 Compare May 15, 2026 02:15
Resolve conflicts from PR NVIDIA#156 (centralized port allocation) which
moved port constants out of per-backend modules.

- Delete src/srtctl/backends/mooncake.py (constants now live in ports.py)
- Add MOONCAKE_METRICS_PORT (8702) to srtctl/ports.py alongside the
  existing MOONCAKE_MASTER_PORT (8700) and MOONCAKE_HTTP_METADATA_PORT
  (8701) so all three live in the same 8700-range
- Repoint imports in sglang.py, vllm.py, do_sweep.py, submit.py
- Refresh docs/mooncake-kv-store.md and CLAUDE.md to use the new ports
- Update test_dry_run.py and test_e2e.py to import port constants from
  srtctl.ports instead of hardcoding 50051

Workers now see MOONCAKE_MASTER=<infra>:8700 (was 50051) and
MOONCAKE_TE_META_DATA_SERVER=http://<infra>:8701/metadata. The
mooncake_master srun is launched with explicit --port, --http_metadata_server_port,
and --metrics_port flags so we don't rely on mooncake's compile-time defaults.
Comment thread src/srtctl/backends/vllm.py Outdated
Comment thread CLAUDE.md Outdated
esmeetu added 4 commits May 16, 2026 00:16
Addresses two review comments on PR NVIDIA#157:

- build_mooncake_store_config now merges defaults with the user dict
  (`{**defaults, **user_cfg}`) and only force-overrides
  master_server_address. New fields vLLM adds to MooncakeStoreConfig
  upstream pass through automatically — no code change here needed.
- CLAUDE.md mooncake_kv_store section drops the "SGLang only" qualifier
  and notes the vLLM behavior (env injection + MOONCAKE_CONFIG_PATH JSON).
- Add a unit test verifying unknown user keys propagate to the JSON.
…examples

- Default `global_segment_size` was 4GB (copied from SGLang quickstart),
  too small for real workloads — bump to 100GB to match the production
  docs example. Updated test and docs accordingly.
- vLLM example YAML no longer lists `MOONCAKE_PROTOCOL` / `MOONCAKE_DEVICE`
  / `MOONCAKE_GLOBAL_SEGMENT_SIZE` under `env:` — those are SGLang-only
  knobs; for vLLM these go in `store_config` (it reads from JSON). Replaced
  with the `MC_*` knobs that vLLM's in-process Mooncake C++ libs actually
  read.
…icitly

`build_mooncake_store_config` no longer pre-fills `metadata_server`,
`global_segment_size`, `local_buffer_size`, `protocol`, `device_name`.
These are hardware-specific knobs (HBM size, NIC layout, RDMA vs TCP) —
silently using a srtslurm-picked default mis-sizes the KV cache or
picks a transport the cluster doesn't support, and the failure mode is
slow/wrong instead of loud. Pass-through-only forces users to think
about the values explicitly; vLLM throws a clear missing-field error
if they're absent.

`master_server_address` is still force-overridden (the user can't know
the infra IP at config time).

- Updated docs and the existing test to assert pass-through semantics
- Quickstart examples already list all the fields, so DX is unchanged
The previous example used standalone MooncakeStoreConnector with
kv_role kv_producer/kv_consumer, which doesn't reflect how this is
actually deployed. Real production form is MultiConnector wrapping
NixlConnector (P2P transfer) + MooncakeStoreConnector (shared store),
both with kv_role: kv_both so prefill and decode run identical
connector stacks. The validator already accepts this form (see
schema.py:1363, test_e2e.py:797).
@esmeetu
Copy link
Copy Markdown
Author

esmeetu commented May 16, 2026

@qiching It should be clean now.

Copy link
Copy Markdown
Collaborator

@qiching qiching left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Good job. Thank you! @esmeetu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants