vllm: support mooncake_kv_store + expose mooncake_master metrics#157
Open
esmeetu wants to merge 7 commits into
Open
vllm: support mooncake_kv_store + expose mooncake_master metrics#157esmeetu wants to merge 7 commits into
esmeetu wants to merge 7 commits into
Conversation
Extends mooncake_master orchestration (previously SGLang-only) to the vLLM backend so vLLM workers can use the in-process MooncakeStore connector for cross-process KV sharing. What's new ---------- * `VLLMMooncakeKVStoreConfig` schema (`backends/vllm.py`): mirrors the SGLang `mooncake_kv_store:` block, plus a vLLM-specific `store_config:` section. srtslurm renders that section into the `MOONCAKE_CONFIG_PATH` JSON file vLLM's `MooncakeStoreConnector` reads on startup (`MooncakeStoreConfig.load_from_env()`); the `env:` map is injected on every vLLM worker for the in-process `MC_*` knobs (e.g. `MC_ENABLE_DEST_DEVICE_AFFINITY`, `MC_STORE_CLIENT_METRIC`, `MC_TE_METRIC`). * `start_mooncake_master` in `cli/do_sweep.py` now fires for vLLM as well as SGLang, writes the per-job store JSON next to the other log artifacts, and stamps the master RPC + HTTP-metadata endpoints on each worker. Shared launch constants moved to a new `backends/mooncake.py` (the SGLang module re-exports them so downstream imports keep working). * Mooncake admin HTTP server (`/metrics`, `/health`, `/role`, `/ha_status`, `/leader`, `/query_key`) is now wired up explicitly: the master srun passes `--enable_metric_reporting=true --metrics_port=9003` (upstream default in master.cpp), and `start_mooncake_master` waits on the metrics port like it already waits on RPC + HTTP-metadata. The flag toggles a periodic stdout log thread only — `MasterAdminServer` listens unconditionally — the new constant's comment documents that to prevent future reordering. * Docs (`docs/mooncake-kv-store.md`): adds a vLLM quick start, a vLLM-specific configuration reference (including `store_config`), expands the ownership table for the auto-stamped vars, and documents the new master metrics endpoint. Validation ---------- * Manual: launched a 1P/1D disagg job; master came up with `enable_metric_reporting=1, metrics_port=9003`; vLLM workers picked up `MOONCAKE_CONFIG_PATH` and registered RDMA segments. `curl :9003/metrics` returned a full Prometheus dump. * Tests: `make check` (ruff + pytest). New cases in `tests/test_e2e.py` cover the vLLM mooncake env injection and JSON rendering; `tests/test_dry_run.py` cases cover the new config surface in `srtctl dry-run` output. Signed-off-by: inf-yasong <yasong.wang@inferact.ai>
8baf3b0 to
f7a1016
Compare
qiching
reviewed
May 14, 2026
qiching
reviewed
May 14, 2026
qiching
reviewed
May 14, 2026
Author
|
Thanks for the review @qiching — addressed all three in 8306191:
|
8306191 to
407e19d
Compare
- Restore BackendProtocol.endpoints_to_processes signature by adding port_allocator parameter and threading it through to the topology helper so per-job port jitter applies uniformly. - Drop unrelated VLLM_NIXL_SIDE_CHANNEL_HOST / get_hostname_ip import that was not mooncake-related. - Widen store_config to dict[str, Any] (matches MooncakeStoreConfig's mix of str/int/size-string fields).
407e19d to
ae711b3
Compare
Resolve conflicts from PR NVIDIA#156 (centralized port allocation) which moved port constants out of per-backend modules. - Delete src/srtctl/backends/mooncake.py (constants now live in ports.py) - Add MOONCAKE_METRICS_PORT (8702) to srtctl/ports.py alongside the existing MOONCAKE_MASTER_PORT (8700) and MOONCAKE_HTTP_METADATA_PORT (8701) so all three live in the same 8700-range - Repoint imports in sglang.py, vllm.py, do_sweep.py, submit.py - Refresh docs/mooncake-kv-store.md and CLAUDE.md to use the new ports - Update test_dry_run.py and test_e2e.py to import port constants from srtctl.ports instead of hardcoding 50051 Workers now see MOONCAKE_MASTER=<infra>:8700 (was 50051) and MOONCAKE_TE_META_DATA_SERVER=http://<infra>:8701/metadata. The mooncake_master srun is launched with explicit --port, --http_metadata_server_port, and --metrics_port flags so we don't rely on mooncake's compile-time defaults.
qiching
reviewed
May 15, 2026
Addresses two review comments on PR NVIDIA#157: - build_mooncake_store_config now merges defaults with the user dict (`{**defaults, **user_cfg}`) and only force-overrides master_server_address. New fields vLLM adds to MooncakeStoreConfig upstream pass through automatically — no code change here needed. - CLAUDE.md mooncake_kv_store section drops the "SGLang only" qualifier and notes the vLLM behavior (env injection + MOONCAKE_CONFIG_PATH JSON). - Add a unit test verifying unknown user keys propagate to the JSON.
…examples - Default `global_segment_size` was 4GB (copied from SGLang quickstart), too small for real workloads — bump to 100GB to match the production docs example. Updated test and docs accordingly. - vLLM example YAML no longer lists `MOONCAKE_PROTOCOL` / `MOONCAKE_DEVICE` / `MOONCAKE_GLOBAL_SEGMENT_SIZE` under `env:` — those are SGLang-only knobs; for vLLM these go in `store_config` (it reads from JSON). Replaced with the `MC_*` knobs that vLLM's in-process Mooncake C++ libs actually read.
…icitly `build_mooncake_store_config` no longer pre-fills `metadata_server`, `global_segment_size`, `local_buffer_size`, `protocol`, `device_name`. These are hardware-specific knobs (HBM size, NIC layout, RDMA vs TCP) — silently using a srtslurm-picked default mis-sizes the KV cache or picks a transport the cluster doesn't support, and the failure mode is slow/wrong instead of loud. Pass-through-only forces users to think about the values explicitly; vLLM throws a clear missing-field error if they're absent. `master_server_address` is still force-overridden (the user can't know the infra IP at config time). - Updated docs and the existing test to assert pass-through semantics - Quickstart examples already list all the fields, so DX is unchanged
The previous example used standalone MooncakeStoreConnector with kv_role kv_producer/kv_consumer, which doesn't reflect how this is actually deployed. Real production form is MultiConnector wrapping NixlConnector (P2P transfer) + MooncakeStoreConnector (shared store), both with kv_role: kv_both so prefill and decode run identical connector stacks. The validator already accepts this form (see schema.py:1363, test_e2e.py:797).
Author
|
@qiching It should be clean now. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends
mooncake_masterorchestration (previously SGLang-only) to the vLLM backend so vLLM workers can use the in-processMooncakeStoreConnectorfor cross-process KV sharing. Also wires the master's admin HTTP server (/metrics,/health, etc.) on a configurable port so ops can scrape Prometheus metrics.What's new
VLLMMooncakeKVStoreConfig(backends/vllm.py): mirrors the SGLangmooncake_kv_store:block and adds a vLLM-specificstore_config:section. srtslurm renders that section into theMOONCAKE_CONFIG_PATHJSON file vLLM'sMooncakeStoreConnectorreads on startup (MooncakeStoreConfig.load_from_env()); theenv:map is injected on every vLLM worker for in-processMC_*knobs (e.g.MC_ENABLE_DEST_DEVICE_AFFINITY,MC_STORE_CLIENT_METRIC,MC_TE_METRIC).backends/mooncake.py; the SGLang module re-exports them so existing imports keep working.start_mooncake_masternow fires for vLLM as well as SGLang, writes the per-job store JSON next to the other log artifacts, and stamps the master RPC + HTTP-metadata endpoints on each worker.--enable_metric_reporting=true --metrics_port=9003(upstream default inmaster.cpp), andstart_mooncake_masterwaits on the metrics port like it already waits on RPC + HTTP-metadata. The flag toggles a periodic stdout log thread only —MasterAdminServerlistens unconditionally — the newMOONCAKE_METRICS_PORTconstant's comment documents that to prevent future reordering. Exposes/metrics,/metrics/summary,/health,/role,/ha_status,/leader,/query_key.Test plan
make check(ruff + pytest, 732 passed / 2 skipped)tests/test_e2e.pycover vLLM mooncake env injection and JSON renderingtests/test_dry_run.pycover the new config surface insrtctl dry-runoutput