[ROCm] MoRI connector telemetry#43218
Conversation
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
…metry Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
There was a problem hiding this comment.
Code Review
This pull request introduces telemetry and Prometheus metrics for the MoRI IO connector, adding a new stats.py module to track transfer performance, byte counts, and failures. The implementation updates the connector and engine to capture metrics during KV transfers. Review feedback identifies several high-severity issues, including potential KeyError exceptions when accessing callback addresses, an incorrect port calculation for notifications that could prevent block freeing on the producer side, and thread-safety risks when updating shared dictionaries across multiple threads.
| self._recving_transfers[request_id].append(transfer_status) | ||
| self._recving_transfers_callback_addr[request_id] = ( | ||
| remote_host, | ||
| str(remote_notify_port + self.tp_rank), |
There was a problem hiding this comment.
The calculation of the notification port by adding self.tp_rank to remote_notify_port appears incorrect. Since remote_notify_port is extracted from the peer's ZMQ address, which already includes the appropriate port offset (calculated via get_port_offset during the peer's initialization), adding the local tp_rank again will result in an incorrect port number. This will cause completion notifications to be sent to the wrong port, potentially leading to memory leaks on the producer side as blocks are never freed.
| str(remote_notify_port + self.tp_rank), | |
| str(remote_notify_port), |
There was a problem hiding this comment.
This is actually fine as long as running with DP1TPx or DPxTP1, because only TP rank 0 runs _ping and hence the remote_notify_port= base_address + tp_rank = base_adress. Hence remote_notify_port+tp_rank is the correct tp rank to notify. When DP+TP support is added, this will have to be changed though.
DP+TP support for MoRI is (partly) covered in #32291.
Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com>
Purpose
Add telemetry to MoRI KV connector.
Logs similar metrics as NIXL:
Test Plan
Expand for build details
and build:
docker build \ -f docker/Dockerfile.rocm_dev \ --build-arg BASE_IMAGE=vllm/vllm-openai-rocm:v0.21.0 \ -t ghcr.io/simondanielsson/vllm/vllm-openai-rocm:moriio-telemetry \ .Test Result
Bench output:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.