Skip to content

[ROCm] MoRI connector telemetry#43218

Draft
simondanielsson wants to merge 3 commits into
vllm-project:mainfrom
simondanielsson:feature/moriio_telemetry
Draft

[ROCm] MoRI connector telemetry#43218
simondanielsson wants to merge 3 commits into
vllm-project:mainfrom
simondanielsson:feature/moriio_telemetry

Conversation

@simondanielsson
Copy link
Copy Markdown
Contributor

@simondanielsson simondanielsson commented May 20, 2026

Purpose

Add telemetry to MoRI KV connector.

Logs similar metrics as NIXL:

(APIServer pid=48125) INFO 05-20 15:38:40 [metrics.py:103] KV Transfer metrics: Num successful transfers=128, Avg xfer time (ms)=63.654, P90 xfer time (ms)=79.441, Avg post time (ms)=2.083, P90 post time (ms)=2.754, Avg MB per transfer=46.266, Throughput (MB/s)=726.829, Avg number of descriptors=188.0

Test Plan

  1. Patch vllm 0.21.0 with this branch's patch
Expand for build details
# docker/Dockerfile.rocm_dev
ARG BASE_IMAGE=vllm/vllm-openai-rocm:v0.21.0
FROM ${BASE_IMAGE}

# BNXT RDMA userspace libraries required by MoRI-IO on my MI300 cluster.
RUN apt-get update -q -y && apt-get install -q -y \
        librdmacm1 \
        libibverbs1 \
        ibverbs-providers \
        ibverbs-utils \
        libibverbs-dev \
        autoconf \
        libtool \
        unzip \
        wget \
    && rm -rf /var/lib/apt/lists/*
RUN wget -q \
        https://docs.broadcom.com/docs-and-downloads/ethernet-network-adapters/NXE/Thor2/GCA1/bcm5760x_230.2.52.0a.zip \
    && unzip -q bcm5760x_230.2.52.0a.zip \
    && cd bcm5760x_230.2.52.0a/drivers_linux/bnxt_rocelib/ \
    && tar -xf "$(find . -name 'libbnxt*.tar.gz' | head -n 1)" \
    && cd "$(find . -maxdepth 1 -type d -name 'libbnxt*' ! -name '*.tar.gz' | head -n 1)" \
    && sh autogen.sh \
    && ./configure \
    && make \
    && find /usr/lib64/ /usr/lib -name "libbnxt_re-rdmav*.so" \
         -exec mv {} {}.inbox \; 2>/dev/null || true \
    && make install all \
    && echo /usr/local/lib >> /etc/ld.so.conf \
    && ldconfig \
    && cp -f bnxt_re.driver /etc/libibverbs.d/ \
    && cd / \
    && rm -rf /bcm5760x_230.2.52.0a /bcm5760x_230.2.52.0a.zip

RUN pip install --no-cache-dir msgpack

# Apply this branch's patch
RUN VLLM_SITE=$(python3 -c "import vllm; import os; print(os.path.dirname(vllm.__file__))") && \
    echo "vLLM installed at: ${VLLM_SITE}" && \
    curl -fsSL https://github.com/vllm-project/vllm/pull/43218.patch -o /tmp/vllm_42838.patch && \
    cd "${VLLM_SITE}/.." && \
    patch -p1 --forward --no-backup-if-mismatch < /tmp/vllm_42838.patch || \
    echo "WARN: vLLM #43218 patch partially applied (may already contain some changes)" && \
    rm /tmp/vllm_42838.patch

and build:

docker build \
    -f docker/Dockerfile.rocm_dev \
    --build-arg BASE_IMAGE=vllm/vllm-openai-rocm:v0.21.0  \
    -t ghcr.io/simondanielsson/vllm/vllm-openai-rocm:moriio-telemetry \
    .
  1. Run MoRI on a single node, and see output (following closely the vLLM MoRI blog post)
# if you didn't build image yourself, you can docker pull ghcr.io/simondanielsson/vllm/vllm-openai-rocm:moriio-telemetry

# Prefill
docker run \
  --rm \
  --pid host \
  --name moriio-prefill \
  --init --network host --ipc host --privileged \
  --cap-add SYS_PTRACE --security-opt seccomp=unconfined \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  --shm-size 256G \
  --group-add video --group-add render \
  --device /dev/kfd --device /dev/dri --device /dev/infiniband \
  -v /sys:/sys \
  -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
  -e HF_HOME=/root/.cache/huggingface \
  -e HF_HUB_ENABLE_HF_TRANSFER=0 \
  -e VLLM_MORIIO_CONNECTOR_READ_MODE=1 \
  -e NCCL_MIN_NCHANNELS=112 \
  -e VLLM_ROCM_USE_AITER=1 \
  -e CUDA_VISIBLE_DEVICES=0,1,2,3 \
  -e HIP_VISIBLE_DEVICES=0,1,2,3 \
  ghcr.io/simondanielsson/vllm/vllm-openai-rocm:moriio-telemetry \
  Qwen/Qwen3-235B-A22B-FP8 \
    -tp 4 \
    --enable-expert-parallel \
    --port 20005 \
    --gpu_memory_utilization 0.8 \
    --max-model-len 16384 \
    --no-enable-prefix-caching \
    --kv-transfer-config '{
      "kv_connector": "MoRIIOConnector",
      "kv_role": "kv_producer",
      "kv_connector_extra_config": {
        "proxy_ip": "127.0.0.1",
        "proxy_ping_port": "36367",
        "http_port": "20005",
        "handshake_port": "6301",
        "notify_port": "6105"
      }
    }'

# Decode
docker run \
  --rm \
  --pid host \
  --name moriio-decode \
  --init --network host --ipc host --privileged \
  --cap-add SYS_PTRACE --security-opt seccomp=unconfined \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  --shm-size 256G \
  --group-add video --group-add render \
  --device /dev/kfd --device /dev/dri --device /dev/infiniband \
  -v /sys:/sys \
  -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
  -e HF_HOME=/root/.cache/huggingface \
  -e HF_HUB_ENABLE_HF_TRANSFER=0 \
  -e VLLM_MORIIO_CONNECTOR_READ_MODE=1 \
  -e NCCL_MIN_NCHANNELS=112 \
  -e VLLM_ROCM_USE_AITER=1 \
  -e CUDA_VISIBLE_DEVICES=4,5,6,7 \
  -e HIP_VISIBLE_DEVICES=4,5,6,7 \
  ghcr.io/simondanielsson/vllm/vllm-openai-rocm:moriio-telemetry \
  Qwen/Qwen3-235B-A22B-FP8 \
    -tp 4 \
    --enable-expert-parallel \
    --port 40005 \
    --max-num-batched-tokens 4096 \
    --gpu_memory_utilization 0.8 \
    --max-model-len 16384 \
    --no-enable-prefix-caching \
    --kv-transfer-config '{
      "kv_connector": "MoRIIOConnector",
      "kv_role": "kv_consumer",
      "kv_connector_extra_config": {
        "proxy_ip": "127.0.0.1",
        "http_port": "40005",
        "proxy_ping_port": "36367",
        "handshake_port": "7301",
        "notify_port": "7501"
      }
    }'

# Router
docker run \
  --name vllm-router \
  --network host \
  --rm \
  vllm/vllm-router:nightly \
  vllm-router \
  --vllm-pd-disaggregation \
  --kv-connector moriio \
  --vllm-discovery-address "0.0.0.0:36367" \
  --policy consistent_hash \
  --prefill-policy consistent_hash \
  --decode-policy consistent_hash

# Bench to get some interesting logs: 1k/1k at 256 concurrency
docker exec moriio-prefill \
  vllm bench serve \
    --base-url http://localhost:30000 \
    --backend vllm \
    --model Qwen/Qwen3-235B-A22B-FP8 \
    --dataset-name random \
    --random-input-len 1000 \
    --random-output-len 1000 \
    --max-concurrency 64 \
    --num-warmups 128 \
    --num-prompts 320 \
    --seed 1234

Test Result

  1. Decode instance logs:
APIServer pid=48125) INFO 05-20 15:38:00 [loggers.py:271] Engine 000: Avg prompt throughput: 4.5 tokens/s, Avg generation throughput: 1370.3 tokens/s, Running: 45 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.4%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 99.9%
(APIServer pid=48125) INFO 05-20 15:38:00 [metrics.py:103] KV Transfer metrics: Num successful transfers=180, Avg xfer time (ms)=40.375, P90 xfer time (ms)=65.157, Avg post time (ms)=2.476, P90 post time (ms)=2.877, Avg MB per transfer=46.266, Throughput (MB/s)=1145.9, Avg number of descriptors=188.0
(APIServer pid=48125) INFO:     45.63.76.253:57410 - "POST /v1/completions HTTP/1.1" 200 OK
...
(APIServer pid=48125) INFO 05-20 15:38:10 [metrics.py:103] KV Transfer metrics: Num successful transfers=76, Avg xfer time (ms)=60.748, P90 xfer time (ms)=79.615, Avg post time (ms)=2.533, P90 post time (ms)=2.92, Avg MB per transfer=46.266, Throughput (MB/s)=761.593, Avg number of descriptors=188.0
...
(APIServer pid=48125) INFO 05-20 15:38:40 [metrics.py:103] KV Transfer metrics: Num successful transfers=128, Avg xfer time (ms)=63.654, P90 xfer time (ms)=79.441, Avg post time (ms)=2.083, P90 post time (ms)=2.754, Avg MB per transfer=46.266, Throughput (MB/s)=726.829, Avg number of descriptors=188.0

Bench output:

============ Serving Benchmark Result ============
Successful requests:                     320
Failed requests:                         0
Maximum request concurrency:             64
Benchmark duration (s):                  157.39
Total input tokens:                      320000
Total generated tokens:                  320000
Request throughput (req/s):              2.03
Output token throughput (tok/s):         2033.20
Peak output token throughput (tok/s):    2240.00
Peak concurrent requests:                82.00
Total token throughput (tok/s):          4066.39
---------------Time to First Token----------------
Mean TTFT (ms):                          931.64
Median TTFT (ms):                        529.85
P99 TTFT (ms):                           4383.04
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          30.05
Median TPOT (ms):                        30.15
P99 TPOT (ms):                           30.34
---------------Inter-token Latency----------------
Mean ITL (ms):                           30.05
Median ITL (ms):                         30.03
P99 ITL (ms):                            36.97
==================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
…metry

Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces telemetry and Prometheus metrics for the MoRI IO connector, adding a new stats.py module to track transfer performance, byte counts, and failures. The implementation updates the connector and engine to capture metrics during KV transfers. Review feedback identifies several high-severity issues, including potential KeyError exceptions when accessing callback addresses, an incorrect port calculation for notifications that could prevent block freeing on the producer side, and thread-safety risks when updating shared dictionaries across multiple threads.

Comment thread vllm/distributed/kv_transfer/kv_connector/v1/moriio/moriio_connector.py Outdated
self._recving_transfers[request_id].append(transfer_status)
self._recving_transfers_callback_addr[request_id] = (
remote_host,
str(remote_notify_port + self.tp_rank),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The calculation of the notification port by adding self.tp_rank to remote_notify_port appears incorrect. Since remote_notify_port is extracted from the peer's ZMQ address, which already includes the appropriate port offset (calculated via get_port_offset during the peer's initialization), adding the local tp_rank again will result in an incorrect port number. This will cause completion notifications to be sent to the wrong port, potentially leading to memory leaks on the producer side as blocks are never freed.

Suggested change
str(remote_notify_port + self.tp_rank),
str(remote_notify_port),

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually fine as long as running with DP1TPx or DPxTP1, because only TP rank 0 runs _ping and hence the remote_notify_port= base_address + tp_rank = base_adress. Hence remote_notify_port+tp_rank is the correct tp rank to notify. When DP+TP support is added, this will have to be changed though.

DP+TP support for MoRI is (partly) covered in #32291.

@mergify mergify Bot added rocm Related to AMD ROCm kv-connector labels May 20, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD May 20, 2026
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kv-connector rocm Related to AMD ROCm

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

1 participant