Structure of Mini-SGLang

System Architecture

Mini-SGLang is designed as a distributed system to handle Large Language Model (LLM) inference efficiently. It consists of several independent processes working together.

Key Components

API Server: The entry point for users. It provides an OpenAI-compatible API (e.g., /v1/chat/completions) to receive prompts and return generated text.
Tokenizer Worker: Converts input text into numbers (tokens) that the model can understand.
Detokenizer Worker: Converts the numbers (tokens) generated by the model back into human-readable text.
Scheduler Worker: The core worker process. In a multi-GPU setup, there is one Scheduler Worker for each GPU (referred to as a TP Rank). It manages the computation and resource allocation for that specific GPU.

Data Flow

The components communicate using ZeroMQ (ZMQ) for control messages and torch.distributed for tensor data exchange between TP ranks.

Request Lifecycle:

User sends a request to the API Server.
API Server forwards it to the Tokenizer.
Tokenizer converts text to tokens and sends them to the Scheduler (Rank 0).
Scheduler (Rank 0) broadcasts the request to all other Schedulers (if using multiple GPUs).
All Schedulers schedule the request and trigger their local Engine to compute the next token.
Scheduler (Rank 0) collects the output token and sends it to the Detokenizer.
Detokenizer converts the token to text and sends it back to the API Server.
API Server streams the result back to the User.

Code Organization (`minisgl` Package)

The source code is located in python/minisgl. Here is a breakdown of the modules for developers:

minisgl.core: Core dataclasses for request/batch state (Req, Batch) and sampling config (SamplingParams).
minisgl.distributed: TP metadata/state helpers (DistributedInfo, set/get/try_get_tp_info) in distributed/tp.py.
minisgl.engine: Per-TP worker runtime (Engine, config, graph runner, sampling glue).
minisgl.scheduler: Scheduling pipeline (prefill/decode/table/cache managers + event loop + I/O mixins).
minisgl.kvcache: KV cache manager interfaces and implementations (NaiveCacheManager, RadixCacheManager).
minisgl.neuron: Neuron model loading and input-building adapters (model_loader.py, inputs.py).
minisgl.message: Message schemas/serialization used between frontend, scheduler, and tokenizer workers.
minisgl.server: CLI parsing, process launch orchestration, and FastAPI frontend server.
minisgl.tokenizer: Tokenize/detokenize worker implementations.
minisgl.llm: Offline/local Python interface (LLM) built on top of scheduler flow.
minisgl.kernel: Low-level tvm-ffi kernels; currently used by radix cache via CPU kernel radix.cpp (fast_compare_key).
minisgl.benchmark: Benchmark client helpers and result processing utilities.
minisgl.utils: Shared utilities (logger, HF config loading, registries, ZMQ wrappers, torch helpers, misc).
minisgl.env: Environment-variable backed runtime knobs.
minisgl.shell: Interactive shell frontend.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Structure of Mini-SGLang

System Architecture

Key Components

Data Flow

Code Organization (`minisgl` Package)

FilesExpand file tree

structures.md

Latest commit

History

structures.md

File metadata and controls

Structure of Mini-SGLang

System Architecture

Key Components

Data Flow

Code Organization (minisgl Package)

Code Organization (`minisgl` Package)