Skip to content

Latest commit

 

History

History
49 lines (37 loc) · 3.23 KB

File metadata and controls

49 lines (37 loc) · 3.23 KB

Structure of Mini-SGLang

System Architecture

Mini-SGLang is designed as a distributed system to handle Large Language Model (LLM) inference efficiently. It consists of several independent processes working together.

Key Components

  • API Server: The entry point for users. It provides an OpenAI-compatible API (e.g., /v1/chat/completions) to receive prompts and return generated text.
  • Tokenizer Worker: Converts input text into numbers (tokens) that the model can understand.
  • Detokenizer Worker: Converts the numbers (tokens) generated by the model back into human-readable text.
  • Scheduler Worker: The core worker process. In a multi-GPU setup, there is one Scheduler Worker for each GPU (referred to as a TP Rank). It manages the computation and resource allocation for that specific GPU.

Data Flow

The components communicate using ZeroMQ (ZMQ) for control messages and torch.distributed for tensor data exchange between TP ranks.

Process overview diagram

Request Lifecycle:

  1. User sends a request to the API Server.
  2. API Server forwards it to the Tokenizer.
  3. Tokenizer converts text to tokens and sends them to the Scheduler (Rank 0).
  4. Scheduler (Rank 0) broadcasts the request to all other Schedulers (if using multiple GPUs).
  5. All Schedulers schedule the request and trigger their local Engine to compute the next token.
  6. Scheduler (Rank 0) collects the output token and sends it to the Detokenizer.
  7. Detokenizer converts the token to text and sends it back to the API Server.
  8. API Server streams the result back to the User.

Code Organization (minisgl Package)

The source code is located in python/minisgl. Here is a breakdown of the modules for developers:

  • minisgl.core: Core dataclasses for request/batch state (Req, Batch) and sampling config (SamplingParams).
  • minisgl.distributed: TP metadata/state helpers (DistributedInfo, set/get/try_get_tp_info) in distributed/tp.py.
  • minisgl.engine: Per-TP worker runtime (Engine, config, graph runner, sampling glue).
  • minisgl.scheduler: Scheduling pipeline (prefill/decode/table/cache managers + event loop + I/O mixins).
  • minisgl.kvcache: KV cache manager interfaces and implementations (NaiveCacheManager, RadixCacheManager).
  • minisgl.neuron: Neuron model loading and input-building adapters (model_loader.py, inputs.py).
  • minisgl.message: Message schemas/serialization used between frontend, scheduler, and tokenizer workers.
  • minisgl.server: CLI parsing, process launch orchestration, and FastAPI frontend server.
  • minisgl.tokenizer: Tokenize/detokenize worker implementations.
  • minisgl.llm: Offline/local Python interface (LLM) built on top of scheduler flow.
  • minisgl.kernel: Low-level tvm-ffi kernels; currently used by radix cache via CPU kernel radix.cpp (fast_compare_key).
  • minisgl.benchmark: Benchmark client helpers and result processing utilities.
  • minisgl.utils: Shared utilities (logger, HF config loading, registries, ZMQ wrappers, torch helpers, misc).
  • minisgl.env: Environment-variable backed runtime knobs.
  • minisgl.shell: Interactive shell frontend.