|
| 1 | +# Technical Briefing on xLLM Core Modules and Architecture |
| 2 | + |
| 3 | +This document provides a technical overview of the core modules within the xLLM inference serving system, detailing their functionality, architectural positioning, and key responsibilities. |
| 4 | + |
| 5 | +## Core Module Functionality Overview |
| 6 | + |
| 7 | +The following table summarizes the primary components of the xLLM architecture, their corresponding code locations, and their roles in the inference pipeline. |
| 8 | + |
| 9 | +| Module Name | Code Path | Core Functionality | Key Responsibilities | |
| 10 | +| :--- | :--- | :--- | :--- | |
| 11 | +| **APIService** | `api_service/` | External Interface Service | Receives external requests (e.g., OpenAI-compatible API calls), translates them into the internal request format, and forwards them to the Master. | |
| 12 | +| **Master** (LLM Master) | `core/runtime/llm_master.cpp` | Control Plane & Request Dispatcher | Receives requests and drives the main execution loop for the Scheduler and Engine. | |
| 13 | +| **Scheduler** (Continuous Scheduler) | `core/scheduler/continuous_scheduler.cpp` | Request Scheduling & Batching | Implements dynamic scheduling policies, prepares batch data for processing, and manages the logical allocation of KV Cache. | |
| 14 | +| **Engine** (LLM Engine) | `core/runtime/llm_engine.cpp` | Execution Layer Manager | Manages distributed Workers, initializes the model, orchestrates inference execution steps, and handles multi-layer pipeline optimization. | |
| 15 | +| **Worker** (LLM Worker / Worker Impl) | `core/runtime/llm_worker_impl.cpp` | Actual Computation Unit | Executes the model's forward pass on the AI accelerator and manages the local physical KV Cache. | |
| 16 | +| **KV Cache Manager** | `core/framework/block/` & `core/framework/xtensor/` | Global KV Cache Management | Provides unified management for KV Cache, including optional support for continuous KV Cache allocation. | |
| 17 | + |
| 18 | +## Detailed Module Descriptions |
| 19 | + |
| 20 | +### 1. APIService |
| 21 | + |
| 22 | +The **APIService** module functions as the external interface layer for xLLM, serving as the initial entry point for all client requests. It is responsible for receiving inference requests and exposing an **OpenAI-compatible API** (e.g., `/v1/chat/completions`). |
| 23 | + |
| 24 | +**Code Entry Point**: `api_service/api_service.cpp` |
| 25 | + |
| 26 | +* **Service Encapsulation**: The service implementation (`chat_service_impl`) is constructed within `api_service/api_service.cpp` and integrated into the **brpc server** framework. |
| 27 | +* **Request Forwarding**: Upon receiving a request, the `ChatCompletionsImpl` invokes the `process_async` method of the `ChatServiceImpl`. |
| 28 | +* **Core Invocation**: In `chat_service_impl.cpp`, the request is ultimately forwarded to the **Master** module via `master_->handle_request` for subsequent scheduling and execution. |
| 29 | + |
| 30 | +<div align="center"> |
| 31 | + <img src="../../assets/APIService.png" width="100%"> |
| 32 | +</div> |
| 33 | + |
| 34 | +### 2. Master (LLM Master) |
| 35 | + |
| 36 | +The **Master** module constitutes the **control plane** of the xLLM inference service. It accepts user requests from the APIService and acts as the central coordinator, managing the entire request lifecycle, including queuing, scheduling, and execution. |
| 37 | + |
| 38 | +**Code Entry Point**: `core/runtime/llm_master.cpp` |
| 39 | + |
| 40 | +* **Component Integration**: The Master initializes core components such as the **Engine** and **Scheduler** within its constructor. |
| 41 | +* **Request Dispatch**: Incoming requests are asynchronously dispatched for execution (`handle_request`) via an internal **thread pool** (`threadpool_`). |
| 42 | +* **Driver Loop**: The Master maintains a primary execution loop (`run`) that continuously invokes the Scheduler's `step` method, driving the overall inference process forward. |
| 43 | + |
| 44 | +<div align="center"> |
| 45 | + <img src="../../assets/Master.png" width="70%"> |
| 46 | +</div> |
| 47 | + |
| 48 | +### 3. Scheduler (Continuous Scheduler) |
| 49 | + |
| 50 | +The **Scheduler** module is responsible for managing all pending requests and applying sophisticated **dynamic scheduling policies** to determine which requests can be coalesced into a single **Batch** for concurrent inference. |
| 51 | + |
| 52 | +**Code Entry Point**: `core/scheduler/continuous_scheduler.cpp` |
| 53 | + |
| 54 | +* **Request Enqueuing**: Requests received from the API service are written to the internal request queue (`request_queue_.write(request)`). |
| 55 | +* **Scheduling Logic**: The `Step` method executes the `schedule_request` logic, which involves preparing batch data (`prepare_batch`), processing both **Prefill** and **Decode** requests, and interacting with the **KV Cache Manager** for cache allocation and deallocation. |
| 56 | +* **Execution Invocation**: Once scheduling is complete, the prepared batch data is passed to the **Engine's** `step` method for execution. |
| 57 | + |
| 58 | +<div align="center"> |
| 59 | + <img src="../../assets/Scheduler.png" width="70%"> |
| 60 | +</div> |
| 61 | + |
| 62 | +### 4. Engine (LLM Engine) |
| 63 | + |
| 64 | +The **Engine** module serves as the **execution layer manager**, responsible for distributing the batch tasks prepared by the Scheduler to the underlying **Worker** computation units and managing the execution flow in a distributed environment. |
| 65 | + |
| 66 | +**Code Entry Point**: `core/runtime/llm_engine.cpp` |
| 67 | + |
| 68 | +* **Initialization**: The Engine initializes the distributed manager (`dist_manager`) during construction. Its `Init()` method handles model initialization and the estimation and allocation of the global KV Cache capacity. |
| 69 | +* **Execution Orchestration**: In each inference step (`Step`), the Engine updates global parameters related to **Data Parallelism (DP)** and asynchronously dispatches the batch task to each Worker client via `worker_clients_[worker_rank]->step_async`. |
| 70 | + |
| 71 | +<div align="center"> |
| 72 | + <img src="../../assets/Engine.png" width="70%"> |
| 73 | +</div> |
| 74 | + |
| 75 | +### 5. Worker (LLM Worker / Worker Impl) |
| 76 | + |
| 77 | +The **Worker** module is the fundamental unit for executing model computation within xLLM, typically bound to one or more **AI accelerator devices**. |
| 78 | + |
| 79 | +**Code Entry Point**: `core/runtime/llm_worker_impl.cpp` |
| 80 | + |
| 81 | +* **Local Resource Management**: The Worker is responsible for allocating the local physical memory for the KV Cache (`allocate_kv_cache`) during its initialization phase. |
| 82 | +* **Asynchronous Computation**: It receives asynchronous calls (`step_async`) from the Engine and schedules the actual computation task via an internal thread pool. |
| 83 | +* **Model Forward Pass**: The core `Step` method invokes `model_executor_->forward`, which executes the model's forward computation, taking the input **Tokens**, position information, and **KV Cache** as arguments. |
| 84 | + |
| 85 | +<div align="center"> |
| 86 | + <img src="../../assets/Worker.png" width="50%"> |
| 87 | +</div> |
| 88 | + |
| 89 | +### 6. KV Cache Manager (Block Manager & xTensor Manager) |
| 90 | + |
| 91 | +The **KV Cache Manager** is responsible for the unified management of the KV Cache, handling allocation when the Scheduler module dispatches requests. |
| 92 | + |
| 93 | +**Code Entry Points**: `core/framework/block/` and `core/framework/xtensor/` |
| 94 | + |
| 95 | +#### 6.1 Block Manager |
| 96 | + |
| 97 | +The **Block Manager** is responsible for the allocation and sharing of KV Cache Blocks, implementing the **PagedAttention** mechanism. |
| 98 | + |
| 99 | +* **Block Allocation**: It manages a pool of free blocks and retrieves a Block ID from this pool during the `allocate` method. |
| 100 | +* **Prefix Cache**: It supports the `allocate_shared` method, which enables the reuse of existing KV blocks by matching a Token ID hash (`prefix_cache_->match`), thereby enhancing memory utilization. |
| 101 | + |
| 102 | +#### 6.2 xTensor Manager |
| 103 | + |
| 104 | +The **xTensor Manager**, in conjunction with the **PageManagerPool**, implements the decoupling of logical address and physical memory mapping based on a **Virtual Memory Management (VMM) API**. This design facilitates **continuous KV Cache storage** and **on-demand physical memory allocation**. |
| 105 | + |
| 106 | +* **xTensor Manager**: In the `allocate` method, it requests physical page IDs from the `PageManagerPool` and attaches these physical pages to multi-layer K/V xTensors, managing the mapping from logical addresses to physical page IDs. |
| 107 | +* **PageManagerPool**: This component manages the pool of physical memory pages on the AI accelerator. It executes low-level operations such as `batch_map` to map physical pages to the virtual pointers of the xTensor, realizing a memory management model characterized by **"logically contiguous, physically discrete"** storage. |
| 108 | + |
| 109 | +<div align="center"> |
| 110 | + <img src="../../assets/KVCacheManager.png" width="80%"> |
| 111 | +</div> |
| 112 | + |
| 113 | +## Execution Flow Diagram |
| 114 | + |
| 115 | +This diagram visually represents the interaction and data flow between the APIService, Master, Scheduler, Engine, and Worker modules, illustrating the control plane and data plane separation. |
| 116 | + |
| 117 | +<div align="center"> |
| 118 | + <img src="../../assets/workflow.png" width="80%"> |
| 119 | +</div> |
0 commit comments