[ROCm] MoRI connector telemetry by simondanielsson · Pull Request #43218 · vllm-project/vllm

simondanielsson · 2026-05-20T14:14:18Z

Purpose

Add telemetry to MoRI KV connector.

Logs similar metrics as NIXL:

(APIServer pid=48125) INFO 05-20 15:38:40 [metrics.py:103] KV Transfer metrics: Num successful transfers=128, Avg xfer time (ms)=63.654, P90 xfer time (ms)=79.441, Avg post time (ms)=2.083, P90 post time (ms)=2.754, Avg MB per transfer=46.266, Throughput (MB/s)=726.829, Avg number of descriptors=188.0

Test Plan

Patch vllm 0.21.0 with this branch's patch

Expand for build details

# docker/Dockerfile.rocm_dev
ARG BASE_IMAGE=vllm/vllm-openai-rocm:v0.21.0
FROM ${BASE_IMAGE}

# BNXT RDMA userspace libraries required by MoRI-IO on my MI300 cluster.
RUN apt-get update -q -y && apt-get install -q -y \
        librdmacm1 \
        libibverbs1 \
        ibverbs-providers \
        ibverbs-utils \
        libibverbs-dev \
        autoconf \
        libtool \
        unzip \
        wget \
    && rm -rf /var/lib/apt/lists/*
RUN wget -q \
        https://docs.broadcom.com/docs-and-downloads/ethernet-network-adapters/NXE/Thor2/GCA1/bcm5760x_230.2.52.0a.zip \
    && unzip -q bcm5760x_230.2.52.0a.zip \
    && cd bcm5760x_230.2.52.0a/drivers_linux/bnxt_rocelib/ \
    && tar -xf "$(find . -name 'libbnxt*.tar.gz' | head -n 1)" \
    && cd "$(find . -maxdepth 1 -type d -name 'libbnxt*' ! -name '*.tar.gz' | head -n 1)" \
    && sh autogen.sh \
    && ./configure \
    && make \
    && find /usr/lib64/ /usr/lib -name "libbnxt_re-rdmav*.so" \
         -exec mv {} {}.inbox \; 2>/dev/null || true \
    && make install all \
    && echo /usr/local/lib >> /etc/ld.so.conf \
    && ldconfig \
    && cp -f bnxt_re.driver /etc/libibverbs.d/ \
    && cd / \
    && rm -rf /bcm5760x_230.2.52.0a /bcm5760x_230.2.52.0a.zip

RUN pip install --no-cache-dir msgpack

# Apply this branch's patch
RUN VLLM_SITE=$(python3 -c "import vllm; import os; print(os.path.dirname(vllm.__file__))") && \
    echo "vLLM installed at: ${VLLM_SITE}" && \
    curl -fsSL https://github.com/vllm-project/vllm/pull/43218.patch -o /tmp/vllm_42838.patch && \
    cd "${VLLM_SITE}/.." && \
    patch -p1 --forward --no-backup-if-mismatch < /tmp/vllm_42838.patch || \
    echo "WARN: vLLM #43218 patch partially applied (may already contain some changes)" && \
    rm /tmp/vllm_42838.patch

and build:

docker build \
    -f docker/Dockerfile.rocm_dev \
    --build-arg BASE_IMAGE=vllm/vllm-openai-rocm:v0.21.0  \
    -t ghcr.io/simondanielsson/vllm/vllm-openai-rocm:moriio-telemetry \
    .

Run MoRI on a single node, and see output (following closely the vLLM MoRI blog post)

# if you didn't build image yourself, you can docker pull ghcr.io/simondanielsson/vllm/vllm-openai-rocm:moriio-telemetry

# Prefill
docker run \
  --rm \
  --pid host \
  --name moriio-prefill \
  --init --network host --ipc host --privileged \
  --cap-add SYS_PTRACE --security-opt seccomp=unconfined \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  --shm-size 256G \
  --group-add video --group-add render \
  --device /dev/kfd --device /dev/dri --device /dev/infiniband \
  -v /sys:/sys \
  -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
  -e HF_HOME=/root/.cache/huggingface \
  -e HF_HUB_ENABLE_HF_TRANSFER=0 \
  -e VLLM_MORIIO_CONNECTOR_READ_MODE=1 \
  -e NCCL_MIN_NCHANNELS=112 \
  -e VLLM_ROCM_USE_AITER=1 \
  -e CUDA_VISIBLE_DEVICES=0,1,2,3 \
  -e HIP_VISIBLE_DEVICES=0,1,2,3 \
  ghcr.io/simondanielsson/vllm/vllm-openai-rocm:moriio-telemetry \
  Qwen/Qwen3-235B-A22B-FP8 \
    -tp 4 \
    --enable-expert-parallel \
    --port 20005 \
    --gpu_memory_utilization 0.8 \
    --max-model-len 16384 \
    --no-enable-prefix-caching \
    --kv-transfer-config '{
      "kv_connector": "MoRIIOConnector",
      "kv_role": "kv_producer",
      "kv_connector_extra_config": {
        "proxy_ip": "127.0.0.1",
        "proxy_ping_port": "36367",
        "http_port": "20005",
        "handshake_port": "6301",
        "notify_port": "6105"
      }
    }'

# Decode
docker run \
  --rm \
  --pid host \
  --name moriio-decode \
  --init --network host --ipc host --privileged \
  --cap-add SYS_PTRACE --security-opt seccomp=unconfined \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  --shm-size 256G \
  --group-add video --group-add render \
  --device /dev/kfd --device /dev/dri --device /dev/infiniband \
  -v /sys:/sys \
  -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
  -e HF_HOME=/root/.cache/huggingface \
  -e HF_HUB_ENABLE_HF_TRANSFER=0 \
  -e VLLM_MORIIO_CONNECTOR_READ_MODE=1 \
  -e NCCL_MIN_NCHANNELS=112 \
  -e VLLM_ROCM_USE_AITER=1 \
  -e CUDA_VISIBLE_DEVICES=4,5,6,7 \
  -e HIP_VISIBLE_DEVICES=4,5,6,7 \
  ghcr.io/simondanielsson/vllm/vllm-openai-rocm:moriio-telemetry \
  Qwen/Qwen3-235B-A22B-FP8 \
    -tp 4 \
    --enable-expert-parallel \
    --port 40005 \
    --max-num-batched-tokens 4096 \
    --gpu_memory_utilization 0.8 \
    --max-model-len 16384 \
    --no-enable-prefix-caching \
    --kv-transfer-config '{
      "kv_connector": "MoRIIOConnector",
      "kv_role": "kv_consumer",
      "kv_connector_extra_config": {
        "proxy_ip": "127.0.0.1",
        "http_port": "40005",
        "proxy_ping_port": "36367",
        "handshake_port": "7301",
        "notify_port": "7501"
      }
    }'

# Router
docker run \
  --name vllm-router \
  --network host \
  --rm \
  vllm/vllm-router:nightly \
  vllm-router \
  --vllm-pd-disaggregation \
  --kv-connector moriio \
  --vllm-discovery-address "0.0.0.0:36367" \
  --policy consistent_hash \
  --prefill-policy consistent_hash \
  --decode-policy consistent_hash

# Bench to get some interesting logs: 1k/1k at 256 concurrency
docker exec moriio-prefill \
  vllm bench serve \
    --base-url http://localhost:30000 \
    --backend vllm \
    --model Qwen/Qwen3-235B-A22B-FP8 \
    --dataset-name random \
    --random-input-len 1000 \
    --random-output-len 1000 \
    --max-concurrency 64 \
    --num-warmups 128 \
    --num-prompts 320 \
    --seed 1234

Test Result

Decode instance logs:

APIServer pid=48125) INFO 05-20 15:38:00 [loggers.py:271] Engine 000: Avg prompt throughput: 4.5 tokens/s, Avg generation throughput: 1370.3 tokens/s, Running: 45 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 99.9%
(APIServer pid=48125) INFO 05-20 15:38:00 [metrics.py:103] KV Transfer metrics: Num successful transfers=180, Avg xfer time (ms)=40.375, P90 xfer time (ms)=65.157, Avg post time (ms)=2.476, P90 post time (ms)=2.877, Avg MB per transfer=46.266, Throughput (MB/s)=1145.9, Avg number of descriptors=188.0
(APIServer pid=48125) INFO:     45.63.76.253:57410 - "POST /v1/completions HTTP/1.1" 200 OK
...
(APIServer pid=48125) INFO 05-20 15:38:10 [metrics.py:103] KV Transfer metrics: Num successful transfers=76, Avg xfer time (ms)=60.748, P90 xfer time (ms)=79.615, Avg post time (ms)=2.533, P90 post time (ms)=2.92, Avg MB per transfer=46.266, Throughput (MB/s)=761.593, Avg number of descriptors=188.0
...
(APIServer pid=48125) INFO 05-20 15:38:40 [metrics.py:103] KV Transfer metrics: Num successful transfers=128, Avg xfer time (ms)=63.654, P90 xfer time (ms)=79.441, Avg post time (ms)=2.083, P90 post time (ms)=2.754, Avg MB per transfer=46.266, Throughput (MB/s)=726.829, Avg number of descriptors=188.0

Bench output:

============ Serving Benchmark Result ============
Successful requests:                     320
Failed requests:                         0
Maximum request concurrency:             64
Benchmark duration (s):                  157.39
Total input tokens:                      320000
Total generated tokens:                  320000
Request throughput (req/s):              2.03
Output token throughput (tok/s):         2033.20
Peak output token throughput (tok/s):    2240.00
Peak concurrent requests:                82.00
Total token throughput (tok/s):          4066.39
---------------Time to First Token----------------
Mean TTFT (ms):                          931.64
Median TTFT (ms):                        529.85
P99 TTFT (ms):                           4383.04
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          30.05
Median TPOT (ms):                        30.15
P99 TPOT (ms):                           30.34
---------------Inter-token Latency----------------
Mean ITL (ms):                           30.05
Median ITL (ms):                         30.03
P99 ITL (ms):                            36.97
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

…metry Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

gemini-code-assist

Code Review

This pull request introduces telemetry and Prometheus metrics for the MoRI IO connector, adding a new stats.py module to track transfer performance, byte counts, and failures. The implementation updates the connector and engine to capture metrics during KV transfers. Review feedback identifies several high-severity issues, including potential KeyError exceptions when accessing callback addresses, an incorrect port calculation for notifications that could prevent block freeing on the producer side, and thread-safety risks when updating shared dictionaries across multiple threads.

gemini-code-assist · 2026-05-20T14:17:03Z

                self._recving_transfers[request_id].append(transfer_status)
                self._recving_transfers_callback_addr[request_id] = (
                    remote_host,
                    str(remote_notify_port + self.tp_rank),


The calculation of the notification port by adding self.tp_rank to remote_notify_port appears incorrect. Since remote_notify_port is extracted from the peer's ZMQ address, which already includes the appropriate port offset (calculated via get_port_offset during the peer's initialization), adding the local tp_rank again will result in an incorrect port number. This will cause completion notifications to be sent to the wrong port, potentially leading to memory leaks on the producer side as blocks are never freed.

Suggested change

str(remote_notify_port + self.tp_rank),

str(remote_notify_port),

This is actually fine as long as running with DP1TPx or DPxTP1, because only TP rank 0 runs _ping and hence the remote_notify_port= base_address + tp_rank = base_adress. Hence remote_notify_port+tp_rank is the correct tp rank to notify. When DP+TP support is added, this will have to be changed though.

DP+TP support for MoRI is (partly) covered in #32291.

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

simondanielsson added 2 commits May 20, 2026 16:02

feat: initial implementation

9e8957b

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

Merge remote-tracking branch 'upstream/main' into feature/moriio_tele…

6190feb

…metry Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

gemini-code-assist Bot reviewed May 20, 2026

View reviewed changes

mergify Bot added rocm Related to AMD ROCm kv-connector labels May 20, 2026

github-project-automation Bot added this to AMD May 20, 2026

github-project-automation Bot moved this to Todo in AMD May 20, 2026

fix: safely .get the callback addresses

dc3709e

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm] MoRI connector telemetry#43218

[ROCm] MoRI connector telemetry#43218
simondanielsson wants to merge 3 commits into
vllm-project:mainfrom
simondanielsson:feature/moriio_telemetry

simondanielsson commented May 20, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot May 20, 2026

Uh oh!

simondanielsson May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	str(remote_notify_port + self.tp_rank),
	str(remote_notify_port),

Uh oh!

Conversation

simondanielsson commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

simondanielsson May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

simondanielsson commented May 20, 2026 •

edited

Loading