InternLM · CUHKSZzxy · May 9, 2025 · May 9, 2025 · May 12, 2025 · May 12, 2025
diff --git a/.github/scripts/check_lmdeploy.py b/.github/scripts/check_lmdeploy.py
@@ -18,6 +18,8 @@ def check_module_init(root: str):
             continue
         elif d.startswith('lmdeploy/lib'):
             continue
+        elif d.startswith('lmdeploy/monitoring'):
+            continue
         elif d.startswith('lmdeploy/serve/turbomind/triton_models'):
             continue
         elif d.startswith('lmdeploy/serve/turbomind/triton_python_backend'):

diff --git a/docs/en/advance/metrics.md b/docs/en/advance/metrics.md
@@ -0,0 +1,123 @@
+# Production Metrics
+
+LMDeploy exposes a set of metrics via Prometheus, and provides visualization via Grafana.
+
+## Setup Guide
+
+This section describes how to set up the monitoring stack (Prometheus + Grafana) provided in the `lmdeploy/monitoring` directory.
+
+## Prerequisites
+
+- [Docker](https://docs.docker.com/engine/install/) and [Docker Compose](https://docs.docker.com/compose/install/) installed
+
+- LMDeploy server running with metrics system enabled
+
+## Usage
+
+1. **Start your LMDeploy server with metrics enabled**
+
+```
+lmdeploy serve api_server Qwen/Qwen2.5-7B-Instruct --enable-metrics
+```
+
+Replace the model path according to your needs.
+By default, the metrics endpoint will be available at `http://<lmdeploy_server_host>:23333/metrics`.
+
+2. **Navigate to the monitoring directory**
+
+```
+cd lmdeploy/monitoring
+```
+
+3. **Start the monitoring stack**
+
+```
+docker compose up
+```
+
+This command will start Prometheus and Grafana in the background.
+
+4. **Access the monitoring interfaces**
+
+- Prometheus: Open your web browser and go to http://localhost:9090.
+
+- Grafana: Open your web browser and go to http://localhost:3000.
+
+5. **Log in to Grafana**
+
+- Default Username: `admin`
+
+- Default Password: `admin` You will be prompted to change the password upon your first login.
+
+6. **View the Dashboard**
+
+The LMDeploy dashboard is pre-configured and should be available automatically.
+
+## Troubleshooting
+
+1. **Port conflicts**
+
+Check if any services are occupying ports `23333` (LMDeploy server port), `9090` (Prometheus port), or `3000` (Grafana port). You can either stop the conflicting running ports or modify the config files as follows:
+
+- Modify LMDeploy server port for Prometheus scrape
+
+In `lmdeploy/monitoring/prometheus.yaml`
+
+```
+global:
+  scrape_interval: 5s
+  evaluation_interval: 30s
+
+scrape_configs:
+  - job_name: lmdeploy
+    static_configs:
+      - targets:
+          - '127.0.0.1:23333' # <= Modify this LMDeploy server port 23333, need to match the running server port
+```
+
+- Modify Prometheus port
+
+In `lmdeploy/monitoring/grafana/datasources/datasource.yaml`
+
+```
+apiVersion: 1
+datasources:
+  - name: Prometheus
+    type: prometheus
+    access: proxy
+    url: http://localhost:9090 # <= Modify this Prometheus interface port 9090
+    isDefault: true
+    editable: false
+```
+
+- Modify Grafana port:
+
+In `lmdeploy/monitoring/docker-compose.yaml`, for example, change the port to `3090`
+
+Option 1: Add `GF_SERVER_HTTP_PORT` to the environment section.
+
+```
+  environment:
+- GF_AUTH_ANONYMOUS_ENABLED=true
+- GF_SERVER_HTTP_PORT=3090  # <= Add this line
+```
+
+Option 2: Use port mapping.
+
+```
+grafana:
+  image: grafana/grafana:latest
+  container_name: grafana
+  ports:
+  - "3090:3000"  # <= Host:Container port mapping
+```
+
+- **No data on the dashboard**
+
+Try to send some requests to the LMDeploy server to create certain traffic
+
+```
+python3 benchmark/profile_restful_api.py --backend lmdeploy --num-prompts 5000 --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json
+```
+
+After refreshing, you should be able to see data on the dashboard.
diff --git a/docs/en/index.rst b/docs/en/index.rst
@@ -104,6 +104,7 @@ Documentation
    advance/structed_output.md
    advance/pytorch_multinodes.md
    advance/pytorch_profiling.md
+   advance/pytorch/metrics.md
 
 .. toctree::
    :maxdepth: 1

diff --git a/docs/zh_cn/advance/metrics.md b/docs/zh_cn/advance/metrics.md
@@ -0,0 +1,122 @@
+# 生产环境指标监控
+
+LMDeploy 通过 Prometheus 暴露监控指标，并通过 Grafana 提供可视化界面。
+
+## 配置指南
+
+本节介绍如何设置 `lmdeploy/monitoring` 目录中提供的监控套件（Prometheus + Grafana）
+
+## 前提条件
+
+- 已安装 [Docker](https://docs.docker.com/engine/install/) 和 [Docker Compose](https://docs.docker.com/compose/install/)
+
+- 已启用指标系统的 LMDeploy 服务正在运行
+
+## 使用说明
+
+1. **启动已启用指标的 LMDeploy 服务**
+
+```
+lmdeploy serve api_server Qwen/Qwen2.5-7B-Instruct --enable-metrics
+```
+
+请根据需求替换模型路径。默认 metrics endpoint 位于 `http://<lmdeploy_server_host>:23333/metrics`。
+
+2. **进入监控目录**
+
+```
+cd lmdeploy/monitoring
+```
+
+3. **启动监控套件**
+
+```
+docker compose up
+```
+
+此命令将在后台启动 Prometheus 和 Grafana。
+
+4. **访问监控界面**
+
+- Prometheus：浏览器访问 http://localhost:9090.
+
+- Grafana：浏览器访问 http://localhost:3000.
+
+5. **登录 Grafana**
+
+- 默认用户名：`admin`
+
+- 默认密码：`admin` （首次登录后会提示修改密码）
+
+6. **查看仪表盘**
+
+预配置的 LMDeploy 仪表盘将自动加载。
+
+## 故障排除
+
+1. **端口冲突**
+
+检查端口 `23333` (LMDeploy 服务端口)、`9090` (Prometheus 端口) 或 `3000` (Grafana 端口) 是否被占用。解决方案，关闭冲突的端口或如下修改配置文件：
+
+- 修改 Prometheus 抓取的 LMDeploy 服务端口
+
+在 `lmdeploy/monitoring/prometheus.yaml` 中
+
+```
+global:
+  scrape_interval: 5s
+  evaluation_interval: 30s
+
+scrape_configs:
+  - job_name: lmdeploy
+    static_configs:
+      - targets:
+          - '127.0.0.1:23333' # <= 修改此处的 LMDeploy 服务端口 23333，需与实际运行端口一致
+```
+
+- 修改 Prometheus 端口
+
+在 `lmdeploy/monitoring/grafana/datasources/datasource.yaml` 中
+
+```
+apiVersion: 1
+datasources:
+  - name: Prometheus
+    type: prometheus
+    access: proxy
+    url: http://localhost:9090 # <= 修改此处的 Prometheus 接口端口 9090
+    isDefault: true
+    editable: false
+```
+
+- 修改 Grafana 端口
+
+在 `lmdeploy/monitoring/docker-compose.yaml` 中操作（例如改为 3090 端口）:
+
+方案一：在环境变量中添加 `GF_SERVER_HTTP_PORT`
+
+```
+  environment:
+- GF_AUTH_ANONYMOUS_ENABLED=true
+- GF_SERVER_HTTP_PORT=3090  # <= 添加此行
+```
+
+方案二：使用端口映射
+
+```
+grafana:
+  image: grafana/grafana:latest
+  container_name: grafana
+  ports:
+  - "3090:3000"  # <= 主机端口:容器端口映射
+```
+
+- **仪表盘无数据**
+
+尝试向 LMDeploy 服务发送请求生成流量：
+
+```
+python3 benchmark/profile_restful_api.py --backend lmdeploy --num-prompts 5000 --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json
+```
+
+刷新后仪表盘应显示数据。
diff --git a/docs/zh_cn/index.rst b/docs/zh_cn/index.rst
@@ -105,6 +105,7 @@ LMDeploy 工具箱提供以下核心功能：
    advance/structed_output.md
    advance/pytorch_multinodes.md
    advance/pytorch_profiling.md
+   advance/metrics.md
 
 .. toctree::
    :maxdepth: 1

diff --git a/lmdeploy/cli/serve.py b/lmdeploy/cli/serve.py
@@ -169,6 +169,7 @@ def add_parser_api_server():
         ArgumentHelper.ep(pt_group)
         ArgumentHelper.enable_microbatch(pt_group)
         ArgumentHelper.enable_eplb(pt_group)
+        ArgumentHelper.enable_metrics(pt_group)
         ArgumentHelper.role(pt_group)
         ArgumentHelper.migration_backend(pt_group)
         # multi-node serving args
@@ -333,6 +334,7 @@ def api_server(args):
                                                  max_prefill_token_num=args.max_prefill_token_num,
                                                  enable_microbatch=args.enable_microbatch,
                                                  enable_eplb=args.enable_eplb,
+                                                 enable_metrics=args.enable_metrics,
                                                  role=EngineRole[args.role],
                                                  migration_backend=MigrationBackend[args.migration_backend],
                                                  model_format=args.model_format)

diff --git a/lmdeploy/cli/utils.py b/lmdeploy/cli/utils.py
@@ -557,6 +557,11 @@ def enable_eplb(parser):
 
         return parser.add_argument('--enable-eplb', action='store_true', help='enable eplb for specified model')
 
+    @staticmethod
+    def enable_metrics(parser):
+        """Add argument enable_metrics to parser."""
+        parser.add_argument('--enable-metrics', action='store_true', default=False, help='enable metrics system')
+
     # For Disaggregation
     @staticmethod
     def role(parser):

diff --git a/lmdeploy/messages.py b/lmdeploy/messages.py
@@ -1,5 +1,6 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 import enum
+import time
 from dataclasses import dataclass, field
 from typing import Callable, Dict, List, Literal, Optional
 
@@ -314,6 +315,7 @@ class PytorchEngineConfig:
             it to True if you want to update weights after create the pipeline
         enable_microbatch (bool): enable microbatch for specified model
         enable_eplb (bool): enable eplb for specified model
+        enable_metrics (bool): enable metrics system
         role (EngineRole): role of engin, options: ['Hybrid', 'Prefill',
             'Decode']. Default to `EngineRole.Hybrid`.
         migration_backend: migration backend. options: ['DLSlime'].
@@ -349,6 +351,7 @@ class PytorchEngineConfig:
     enable_eplb: bool = False
     enable_mp_engine: bool = False
     model_format: str = None
+    enable_metrics: bool = False
 
     role: EngineRole = EngineRole.Hybrid
     migration_backend: MigrationBackend = MigrationBackend.DLSlime
@@ -422,6 +425,45 @@ class Response:
     index: int = 0
 
 
+# copy from https://github.com/vllm-project/vllm/blob/main/vllm/v1/engine/__init__.py
+class EngineCoreEventType(enum.IntEnum):
+    """The type of engine core request event.
+
+    QUEUED - when the request was received by the engine core and added to the scheduler queue
+    SCHEDULED - when the request was first scheduled for execution
+    PREEMPTED - the request has been put back in the waiting queue in order to make room for other requests to complete.
+                It will be re-scheduled in future and re-start its prefill phase
+    """
+    QUEUED = 1
+    SCHEDULED = 2
+    PREEMPTED = 3  # FIXME, currently ignored for simplicity
+
+
+# copy from https://github.com/vllm-project/vllm/blob/main/vllm/v1/engine/__init__.py
+@dataclass
+class EngineCoreEvent():
+    """A timestamped engine core event associated with a request.
+
+    The timestamp is a monotonic timestamps and is used for by the engine frontend to calculate intervals between engine
+    core events. These timestamps should not be compared with timestamps from other processes.
+    """
+    type: EngineCoreEventType
+    timestamp: float
+
+    @classmethod
+    def new_event(cls, event_type: EngineCoreEventType, timestamp: Optional[float] = None) -> 'EngineCoreEvent':
+        timestamp = time.perf_counter() if timestamp is None else timestamp
+        return cls(event_type, timestamp)
+
+
+@dataclass
+class MetricsInfo:
+    """Metrics info from the inference engine."""
+    engine_core_timestamp: float = 0.0
+    engine_core_events: List[EngineCoreEvent] = field(default_factory=list)
+    scheduler_raw_info: dict = field(default_factory=dict)
+
+
 @dataclass
 class EngineOutput:
     """Engine output for turbomind/pytorch engine.
@@ -435,6 +477,7 @@ class EngineOutput:
             position.
         cache_block_ids (List[int]): send cache blocks back for migration in
             Disaggregated LLM Serving when Prefill Engine is Done.
+        metrics_info (MetricsInfo): metrics info from the inference engine.
     """
     status: ResponseType
     token_ids: List[int]
@@ -444,6 +487,7 @@ class EngineOutput:
     last_hidden_state: torch.Tensor = None
 
     cache_block_ids: Optional[List[int]] = None
+    metrics_info: Optional[MetricsInfo] = None
 
 
 @dataclass

diff --git a/lmdeploy/metrics/__init__.py b/lmdeploy/metrics/__init__.py
@@ -0,0 +1 @@
+# Copyright (c) OpenMMLab. All rights reserved.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		# Copyright (c) OpenMMLab. All rights reserved.