Skip to content

Commit e1f430d

Browse files
committed
Merge remote-tracking branch 'upstream/main' into resolve-github-conflict
Signed-off-by: Moersity <lixiang0417.cq@gmail.com>
2 parents 9e48e8a + a13ab18 commit e1f430d

21 files changed

Lines changed: 888 additions & 211 deletions

File tree

.github/CODEOWNERS

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
# Python bindings
88
/bindings/python @CatherineSue @key4ng @gongwei-130 @slin1237
9-
/bindings/golang @whybeyoung @slin1237
9+
/bindings/golang @slin1237
1010

1111
# E2E tests
1212
/e2e_test @CatherineSue @key4ng @XinyueZhang369 @slin1237
@@ -26,12 +26,12 @@
2626

2727
# Workspace crates
2828
/crates/auth @slin1237
29-
/crates/data_connector @key4ng @zhoug9127 @zhaowenzi
29+
/crates/data_connector @key4ng @zhoug9127
3030
/crates/grpc_client @CatherineSue @slin1237
3131
/crates/grpc_client/proto/vllm_engine.proto @njhill @CatherineSue @slin1237
3232
/crates/grpc_client/proto/common.proto @njhill @CatherineSue @slin1237
3333
/crates/kv_index @slin1237
34-
/crates/mcp @key4ng @CatherineSue @slin1237 @zhoug9127 @zhaowenzi
34+
/crates/mcp @key4ng @CatherineSue @slin1237 @zhoug9127
3535
/crates/mesh @tonyluj @llfl @slin1237
3636
/crates/multimodal @slin1237 @CatherineSue
3737
/crates/protocols @CatherineSue @key4ng

bindings/python/src/lib.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1198,6 +1198,7 @@ impl Router {
11981198
host: self.host.clone(),
11991199
port: self.port,
12001200
health_check_port: self.health_check_port,
1201+
runtime_worker_threads: None,
12011202
router_config,
12021203
max_payload_size: self.max_payload_size,
12031204
log_dir: self.log_dir.clone(),

crates/grpc_client/python/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ This package provides pre-compiled Python gRPC stubs for:
1717
pip install smg-grpc-proto
1818
```
1919

20-
Requires `grpcio>=1.78.0` and `protobuf>=5.26.0`.
20+
Requires `grpcio>=1.81.1` and `protobuf>=5.26.0`.
2121

2222
## Usage
2323

crates/grpc_client/python/pyproject.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
[build-system]
2-
requires = ["setuptools>=68.0", "grpcio-tools>=1.78.0"]
2+
requires = ["setuptools>=68.0", "grpcio-tools>=1.81.1"]
33
build-backend = "setuptools.build_meta"
44

55
[project]
@@ -8,7 +8,7 @@ version = "0.4.11"
88
description = "SMG gRPC proto definitions for vLLM, TRT-LLM, MLX, TokenSpeed, and SGLang"
99
requires-python = ">=3.10"
1010
dependencies = [
11-
"grpcio>=1.78.0",
11+
"grpcio>=1.81.1",
1212
"protobuf>=5.26.0",
1313
]
1414
readme = "README.md"

docs/getting-started/kv-events-cache-aware.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,31 @@ SMG learns the block size from the `BlockStored` events themselves, so you needn
117117

118118
Everything downstream — SMG flags, block-size learning, and the verification logs — is unchanged; `KvEventMonitor` consumes the events the same way for any gRPC worker.
119119

120+
### Alternative: launch a TokenSpeed worker
121+
122+
TokenSpeed's scheduler publishes KV cache events on a ZMQ socket; enable them with `--kv-events-config`. The TokenSpeed gRPC server *is* the SMG gRPC entrypoint, so there is no separate `--grpc` flag:
123+
124+
```bash
125+
# TokenSpeed is installed from source (engine + kernel + scheduler); see
126+
# scripts/ci_install_tokenspeed.sh. Install the bridge's extra deps:
127+
pip install "smg-grpc-servicer[tokenspeed]"
128+
129+
# --kv-events-config turns on KV-event publishing in the scheduler:
130+
python -m smg_grpc_servicer.tokenspeed \
131+
--model meta-llama/Llama-3.1-8B-Instruct \
132+
--host 0.0.0.0 \
133+
--port 50051 \
134+
--kv-events-config '{"enable_kv_cache_events": true, "publisher": "zmq", "endpoint": "tcp://*:5557", "topic": "kv-events"}'
135+
```
136+
137+
| Field | Why |
138+
|---|---|
139+
| `enable_kv_cache_events: true` | TokenSpeed master switch. Without it the scheduler records no events even if a publisher is set. |
140+
| `publisher: "zmq"` | Selects the ZMQ publisher the servicer bridges. Unset defaults to `"zmq"` when events are enabled; `"null"` (or any other value) disables bridging. |
141+
| `endpoint` / `topic` | ZMQ `PUB` address and topic prefix. Use a **bind-style** endpoint (`tcp://*:PORT`) — TokenSpeed only *binds* when the endpoint contains `*`/`::`/`ipc://`/`inproc://`, so a concrete address like `tcp://127.0.0.1:PORT` makes it *connect* instead, leaving nothing bound and the stream idle. For data-parallel the port is `endpoint_port + dp_rank`, and SMG currently consumes rank 0. |
142+
143+
`--kv-events-config` is parsed by TokenSpeed's `KVEventsConfig.from_cli`. SMG learns the block size from the `BlockStored` events themselves, so you needn't set it; pass TokenSpeed's `--page-size N` only to pin a non-default value. Everything downstream is identical to the SGLang and vLLM paths.
144+
120145
---
121146

122147
## Step 2 — Launch SMG
@@ -240,4 +265,7 @@ If events never arrive, the policy keeps working — it falls back to the approx
240265
- Event subscription manager: `model_gateway/src/worker/kv_event_monitor.rs`
241266
- KV event proto: `crates/grpc_client/proto/common.proto` (messages `KvEventBatch`, `KvCacheEvent`, `KvBlocksStored`, `KvBlocksRemoved`)
242267
- Servicer bridge: `grpc_servicer/smg_grpc_servicer/sglang/servicer.py` (`SubscribeKvEvents`)
268+
- Shared ZMQ→proto conversion: `grpc_servicer/smg_grpc_servicer/kv_events.py` (engine-neutral; used by the vLLM and TokenSpeed bridges)
269+
- TokenSpeed servicer bridge: `grpc_servicer/smg_grpc_servicer/tokenspeed/servicer.py` (`SubscribeKvEvents`) + config resolver `grpc_servicer/smg_grpc_servicer/tokenspeed/kv_events.py`
243270
- SGLang upstream config: `python/sglang/srt/disaggregation/kv_events.py` (class `KVEventsConfig`)
271+
- TokenSpeed upstream config: `tokenspeed/runtime/pd/kv_events.py` (class `KVEventsConfig`)

docs/reference/configuration.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -517,6 +517,35 @@ headers can otherwise spoof storage hook request context values.
517517

518518
---
519519

520+
## Runtime Configuration
521+
522+
Controls the tokio async runtime that backs request handling.
523+
524+
By default the runtime is **container-aware**. tokio sizes its worker pool to
525+
`std::thread::available_parallelism()`, which on Rust 1.95+ already reads the
526+
cgroup CPU quota — so under a Kubernetes `limits.cpu` the worker count matches
527+
the pod's quota, not the host's core count. No extra configuration is needed for
528+
the default to be right under a CPU limit.
529+
530+
Do **not** set an inflated `TOKIO_WORKER_THREADS` (for example a fixed `32`).
531+
That overrides the container-aware default and oversubscribes worker threads
532+
against the cores the scheduler actually grants, causing scheduler thrash,
533+
tail-latency spikes, and `/health` starvation. Leaving it unset is the correct
534+
production configuration.
535+
536+
### Worker Threads
537+
538+
Explicit async runtime worker-thread count. Leave unset to use tokio's
539+
container-aware default above; set it only to pin an explicit count (overriding
540+
the cgroup-quota-derived default).
541+
542+
| Option | `--runtime-worker-threads` |
543+
|--------|----------------------------|
544+
| Environment | - |
545+
| Default | tokio default (`available_parallelism()`, cgroup-quota-aware) |
546+
547+
---
548+
520549
## Rate Limiting Configuration
521550

522551
### Concurrent Request Limit

grpc_servicer/pyproject.toml

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,9 @@ description = "SMG gRPC servicer implementations for LLM inference engines (vLLM
99
requires-python = ">=3.10"
1010
dependencies = [
1111
"smg-grpc-proto>=0.4.11",
12-
"grpcio>=1.78.0",
13-
"grpcio-reflection>=1.78.0",
14-
"grpcio-health-checking>=1.78.0",
12+
"grpcio>=1.81.1",
13+
"grpcio-reflection>=1.81.1",
14+
"grpcio-health-checking>=1.81.1",
1515
]
1616
readme = "README.md"
1717
license = { text = "Apache-2.0" }
@@ -36,6 +36,9 @@ sglang = ["sglang>=0.5.10"]
3636
# without this floor, installing [mlx] against an older proto build would
3737
# crash at import time when smg_grpc_servicer.mlx.server runs.
3838
mlx = ["smg-grpc-proto>=0.4.7", "mlx>=0.22.0", "mlx-lm>=0.22.0"]
39+
# TokenSpeed itself is installed from source (no PyPI release); these are the
40+
# extra runtime deps the KV-event bridge needs on top of a TokenSpeed install.
41+
tokenspeed = ["pyzmq>=25.0.0", "msgspec>=0.18.0"]
3942

4043
[project.urls]
4144
Homepage = "https://github.com/lightseekorg/smg"
Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
"""Engine-neutral KV-cache-event → proto conversion and ZMQ streaming.
2+
3+
Shared by every engine bridge (vLLM, TokenSpeed, ...). Imports only stdlib +
4+
the generated proto, and dispatches engine events by class name (BlockStored /
5+
BlockRemoved / AllBlocksCleared), so it needs no engine import and is
6+
unit-testable without any engine installed.
7+
8+
Each engine package keeps its own ``resolve_kv_events_config`` (the only
9+
engine-specific seam); everything here is wire-format-only.
10+
"""
11+
12+
import logging
13+
from collections.abc import AsyncIterator, Awaitable, Callable
14+
15+
from smg_grpc_proto.generated import common_pb2
16+
17+
logger = logging.getLogger(__name__)
18+
19+
_U64_MASK = 0xFFFFFFFFFFFFFFFF
20+
_I64_SIGN_BIT = 0x8000000000000000
21+
_U64_MODULUS = 0x10000000000000000
22+
23+
24+
def to_int64(value: int | bytes) -> int:
25+
"""Reduce an engine block hash to a signed int64 for the proto block_hash field.
26+
27+
An engine's block hash may be ``int | bytes`` (sha256 bytes when int hashes
28+
are disabled); bytes are read big-endian. SMG uses the hash only as a node
29+
identity, so the 64-bit reduction is safe as long as it stays deterministic.
30+
"""
31+
if isinstance(value, (bytes, bytearray)):
32+
value = int.from_bytes(value, "big")
33+
masked = value & _U64_MASK
34+
if masked >= _I64_SIGN_BIT:
35+
masked -= _U64_MODULUS
36+
return masked
37+
38+
39+
def endpoint_for_rank(endpoint: str, dp_rank: int) -> str:
40+
"""Resolve a KV-events PUB endpoint to a connectable SUB address.
41+
42+
Bind wildcards (``*``, ``0.0.0.0``) are rewritten to ``127.0.0.1`` (the
43+
latter is not connectable on macOS/Windows). For data-parallel deployments
44+
each rank publishes on ``base_port + dp_rank``; non-tcp endpoints (ipc://,
45+
inproc://) get the wildcard substituted but no port arithmetic.
46+
"""
47+
resolved = endpoint.replace("*", "127.0.0.1").replace("0.0.0.0", "127.0.0.1")
48+
if resolved.startswith("tcp://") and dp_rank:
49+
host, sep, port = resolved.rpartition(":")
50+
if sep and port.isdigit():
51+
return f"{host}:{int(port) + dp_rank}"
52+
return resolved
53+
54+
55+
def convert_event(event: object, event_id: int) -> common_pb2.KvCacheEvent | None:
56+
"""Convert one decoded engine event to a proto KvCacheEvent (or None if unknown)."""
57+
name = type(event).__name__
58+
59+
if name == "BlockStored":
60+
block_size = int(event.block_size)
61+
blocks = []
62+
for i, block_hash in enumerate(event.block_hashes):
63+
start = i * block_size
64+
end = start + block_size
65+
block = common_pb2.KvBlock(
66+
block_hash=to_int64(block_hash),
67+
token_ids=list(event.token_ids[start:end]),
68+
block_size=block_size,
69+
)
70+
lora_id = getattr(event, "lora_id", None)
71+
if lora_id is not None:
72+
block.lora_id = to_int64(lora_id)
73+
blocks.append(block)
74+
stored = common_pb2.KvBlocksStored(blocks=blocks)
75+
parent = getattr(event, "parent_block_hash", None)
76+
if parent is not None:
77+
stored.parent_block_hash = to_int64(parent)
78+
return common_pb2.KvCacheEvent(event_id=event_id, stored=stored)
79+
80+
if name == "BlockRemoved":
81+
return common_pb2.KvCacheEvent(
82+
event_id=event_id,
83+
removed=common_pb2.KvBlocksRemoved(
84+
block_hashes=[to_int64(h) for h in event.block_hashes]
85+
),
86+
)
87+
88+
if name == "AllBlocksCleared":
89+
return common_pb2.KvCacheEvent(event_id=event_id, cleared=common_pb2.KvCacheCleared())
90+
91+
logger.debug("Unknown KV event type %r, skipping", name)
92+
return None
93+
94+
95+
def convert_batch(
96+
raw_batch: object, seq_num: int, event_id_start: int
97+
) -> tuple[common_pb2.KvEventBatch, int]:
98+
"""Convert a decoded engine KVEventBatch to a proto KvEventBatch.
99+
100+
Returns the proto batch and the new event-id counter. The counter advances
101+
once per input event (even if unconvertible) so ids stay monotonic.
102+
103+
The DP rank is read from ``data_parallel_rank`` (vLLM) or ``attn_dp_rank``
104+
(TokenSpeed); engines that carry neither leave the proto field unset.
105+
"""
106+
proto = common_pb2.KvEventBatch(sequence_number=seq_num, timestamp=raw_batch.ts)
107+
dp_rank = getattr(raw_batch, "data_parallel_rank", None)
108+
if dp_rank is None:
109+
dp_rank = getattr(raw_batch, "attn_dp_rank", None)
110+
if dp_rank is not None:
111+
proto.dp_rank = dp_rank
112+
113+
event_id = event_id_start
114+
for event in raw_batch.events:
115+
event_id += 1
116+
proto_event = convert_event(event, event_id)
117+
if proto_event is not None:
118+
proto.events.append(proto_event)
119+
return proto, event_id
120+
121+
122+
async def stream_kv_events(
123+
sub_socket: object,
124+
decode: Callable[[bytes], object],
125+
send_initial_metadata: Callable[[], Awaitable[None]],
126+
is_cancelled: Callable[[], bool],
127+
*,
128+
recv_timeout: float = 1.0,
129+
) -> AsyncIterator[common_pb2.KvEventBatch]:
130+
"""Core ZMQ→proto streaming loop, decoupled from any engine and gRPC types.
131+
132+
Args:
133+
sub_socket: a connected ``zmq.asyncio`` SUB socket (duck-typed; only
134+
``poll()`` and ``recv_multipart()`` are used). The caller owns the
135+
socket lifecycle (this function never closes it).
136+
decode: bytes → decoded engine batch (e.g. ``msgspec.msgpack.Decoder(KVEventBatch).decode``).
137+
send_initial_metadata: awaitable called once before the first recv so the
138+
gRPC client's ``subscribe_kv_events().await`` resolves promptly.
139+
is_cancelled: returns True when the RPC is cancelled; loop then exits.
140+
recv_timeout: poll timeout so cancellation is observed even when idle.
141+
142+
Yields proto KvEventBatch using the ZMQ publisher's native sequence numbers.
143+
"""
144+
await send_initial_metadata()
145+
event_id = 0
146+
while not is_cancelled():
147+
# poll() before recv: cancelling a zmq.asyncio recv future does not
148+
# cancel the in-flight ZMQ recv and can drop an already-dequeued message.
149+
if not await sub_socket.poll(timeout=int(recv_timeout * 1000)):
150+
continue
151+
frames = await sub_socket.recv_multipart()
152+
153+
# ZMQ multipart: [topic, 8-byte big-endian seq, msgpack payload].
154+
if len(frames) < 3:
155+
continue
156+
zmq_seq = int.from_bytes(frames[1], "big")
157+
try:
158+
raw_batch = decode(frames[2])
159+
except Exception as e: # noqa: BLE001 - one bad frame must not kill the stream
160+
logger.warning("Failed to decode KV event batch: %s", e)
161+
continue
162+
163+
proto_batch, event_id = convert_batch(raw_batch, zmq_seq, event_id)
164+
yield proto_batch

0 commit comments

Comments
 (0)