Skip to content

Commit 7cf4101

Browse files
author
yanke.55
committed
docs: add code structure documentation
1 parent c28a7a2 commit 7cf4101

File tree

8 files changed

+119
-0
lines changed

8 files changed

+119
-0
lines changed

docs/assets/APIService.png

64.8 KB
Loading

docs/assets/Engine.png

49.2 KB
Loading

docs/assets/KVCacheManager.png

151 KB
Loading

docs/assets/Master.png

36.7 KB
Loading

docs/assets/Scheduler.png

36.4 KB
Loading

docs/assets/Worker.png

66.7 KB
Loading

docs/assets/workflow.png

1.35 MB
Loading
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Technical Briefing on xLLM Core Modules and Architecture
2+
3+
This document provides a technical overview of the core modules within the xLLM inference serving system, detailing their functionality, architectural positioning, and key responsibilities.
4+
5+
## Core Module Functionality Overview
6+
7+
The following table summarizes the primary components of the xLLM architecture, their corresponding code locations, and their roles in the inference pipeline.
8+
9+
| Module Name | Code Path | Core Functionality | Key Responsibilities |
10+
| :--- | :--- | :--- | :--- |
11+
| **APIService** | `api_service/` | External Interface Service | Receives external requests (e.g., OpenAI-compatible API calls), translates them into the internal request format, and forwards them to the Master. |
12+
| **Master** (LLM Master) | `core/runtime/llm_master.cpp` | Control Plane & Request Dispatcher | Receives requests and drives the main execution loop for the Scheduler and Engine. |
13+
| **Scheduler** (Continuous Scheduler) | `core/scheduler/continuous_scheduler.cpp` | Request Scheduling & Batching | Implements dynamic scheduling policies, prepares batch data for processing, and manages the logical allocation of KV Cache. |
14+
| **Engine** (LLM Engine) | `core/runtime/llm_engine.cpp` | Execution Layer Manager | Manages distributed Workers, initializes the model, orchestrates inference execution steps, and handles multi-layer pipeline optimization. |
15+
| **Worker** (LLM Worker / Worker Impl) | `core/runtime/llm_worker_impl.cpp` | Actual Computation Unit | Executes the model's forward pass on the AI accelerator and manages the local physical KV Cache. |
16+
| **KV Cache Manager** | `core/framework/block/` & `core/framework/xtensor/` | Global KV Cache Management | Provides unified management for KV Cache, including optional support for continuous KV Cache allocation. |
17+
18+
## Detailed Module Descriptions
19+
20+
### 1. APIService
21+
22+
The **APIService** module functions as the external interface layer for xLLM, serving as the initial entry point for all client requests. It is responsible for receiving inference requests and exposing an **OpenAI-compatible API** (e.g., `/v1/chat/completions`).
23+
24+
**Code Entry Point**: `api_service/api_service.cpp`
25+
26+
* **Service Encapsulation**: The service implementation (`chat_service_impl`) is constructed within `api_service/api_service.cpp` and integrated into the **brpc server** framework.
27+
* **Request Forwarding**: Upon receiving a request, the `ChatCompletionsImpl` invokes the `process_async` method of the `ChatServiceImpl`.
28+
* **Core Invocation**: In `chat_service_impl.cpp`, the request is ultimately forwarded to the **Master** module via `master_->handle_request` for subsequent scheduling and execution.
29+
30+
<div align="center">
31+
<img src="../../assets/APIService.png" width="100%">
32+
</div>
33+
34+
### 2. Master (LLM Master)
35+
36+
The **Master** module constitutes the **control plane** of the xLLM inference service. It accepts user requests from the APIService and acts as the central coordinator, managing the entire request lifecycle, including queuing, scheduling, and execution.
37+
38+
**Code Entry Point**: `core/runtime/llm_master.cpp`
39+
40+
* **Component Integration**: The Master initializes core components such as the **Engine** and **Scheduler** within its constructor.
41+
* **Request Dispatch**: Incoming requests are asynchronously dispatched for execution (`handle_request`) via an internal **thread pool** (`threadpool_`).
42+
* **Driver Loop**: The Master maintains a primary execution loop (`run`) that continuously invokes the Scheduler's `step` method, driving the overall inference process forward.
43+
44+
<div align="center">
45+
<img src="../../assets/Master.png" width="70%">
46+
</div>
47+
48+
### 3. Scheduler (Continuous Scheduler)
49+
50+
The **Scheduler** module is responsible for managing all pending requests and applying sophisticated **dynamic scheduling policies** to determine which requests can be coalesced into a single **Batch** for concurrent inference.
51+
52+
**Code Entry Point**: `core/scheduler/continuous_scheduler.cpp`
53+
54+
* **Request Enqueuing**: Requests received from the API service are written to the internal request queue (`request_queue_.write(request)`).
55+
* **Scheduling Logic**: The `Step` method executes the `schedule_request` logic, which involves preparing batch data (`prepare_batch`), processing both **Prefill** and **Decode** requests, and interacting with the **KV Cache Manager** for cache allocation and deallocation.
56+
* **Execution Invocation**: Once scheduling is complete, the prepared batch data is passed to the **Engine's** `step` method for execution.
57+
58+
<div align="center">
59+
<img src="../../assets/Scheduler.png" width="70%">
60+
</div>
61+
62+
### 4. Engine (LLM Engine)
63+
64+
The **Engine** module serves as the **execution layer manager**, responsible for distributing the batch tasks prepared by the Scheduler to the underlying **Worker** computation units and managing the execution flow in a distributed environment.
65+
66+
**Code Entry Point**: `core/runtime/llm_engine.cpp`
67+
68+
* **Initialization**: The Engine initializes the distributed manager (`dist_manager`) during construction. Its `Init()` method handles model initialization and the estimation and allocation of the global KV Cache capacity.
69+
* **Execution Orchestration**: In each inference step (`Step`), the Engine updates global parameters related to **Data Parallelism (DP)** and asynchronously dispatches the batch task to each Worker client via `worker_clients_[worker_rank]->step_async`.
70+
71+
<div align="center">
72+
<img src="../../assets/Engine.png" width="70%">
73+
</div>
74+
75+
### 5. Worker (LLM Worker / Worker Impl)
76+
77+
The **Worker** module is the fundamental unit for executing model computation within xLLM, typically bound to one or more **AI accelerator devices**.
78+
79+
**Code Entry Point**: `core/runtime/llm_worker_impl.cpp`
80+
81+
* **Local Resource Management**: The Worker is responsible for allocating the local physical memory for the KV Cache (`allocate_kv_cache`) during its initialization phase.
82+
* **Asynchronous Computation**: It receives asynchronous calls (`step_async`) from the Engine and schedules the actual computation task via an internal thread pool.
83+
* **Model Forward Pass**: The core `Step` method invokes `model_executor_->forward`, which executes the model's forward computation, taking the input **Tokens**, position information, and **KV Cache** as arguments.
84+
85+
<div align="center">
86+
<img src="../../assets/Worker.png" width="50%">
87+
</div>
88+
89+
### 6. KV Cache Manager (Block Manager & xTensor Manager)
90+
91+
The **KV Cache Manager** is responsible for the unified management of the KV Cache, handling allocation when the Scheduler module dispatches requests.
92+
93+
**Code Entry Points**: `core/framework/block/` and `core/framework/xtensor/`
94+
95+
#### 6.1 Block Manager
96+
97+
The **Block Manager** is responsible for the allocation and sharing of KV Cache Blocks, implementing the **PagedAttention** mechanism.
98+
99+
* **Block Allocation**: It manages a pool of free blocks and retrieves a Block ID from this pool during the `allocate` method.
100+
* **Prefix Cache**: It supports the `allocate_shared` method, which enables the reuse of existing KV blocks by matching a Token ID hash (`prefix_cache_->match`), thereby enhancing memory utilization.
101+
102+
#### 6.2 xTensor Manager
103+
104+
The **xTensor Manager**, in conjunction with the **PageManagerPool**, implements the decoupling of logical address and physical memory mapping based on a **Virtual Memory Management (VMM) API**. This design facilitates **continuous KV Cache storage** and **on-demand physical memory allocation**.
105+
106+
* **xTensor Manager**: In the `allocate` method, it requests physical page IDs from the `PageManagerPool` and attaches these physical pages to multi-layer K/V xTensors, managing the mapping from logical addresses to physical page IDs.
107+
* **PageManagerPool**: This component manages the pool of physical memory pages on the AI accelerator. It executes low-level operations such as `batch_map` to map physical pages to the virtual pointers of the xTensor, realizing a memory management model characterized by **"logically contiguous, physically discrete"** storage.
108+
109+
<div align="center">
110+
<img src="../../assets/KVCacheManager.png" width="80%">
111+
</div>
112+
113+
## Execution Flow Diagram
114+
115+
This diagram visually represents the interaction and data flow between the APIService, Master, Scheduler, Engine, and Worker modules, illustrating the control plane and data plane separation.
116+
117+
<div align="center">
118+
<img src="../../assets/workflow.png" width="80%">
119+
</div>

0 commit comments

Comments
 (0)