Mini-SGLang is designed as a distributed system to handle Large Language Model (LLM) inference efficiently. It consists of several independent processes working together.
- API Server: The entry point for users. It provides an OpenAI-compatible API (e.g.,
/v1/chat/completions) to receive prompts and return generated text. - Tokenizer Worker: Converts input text into numbers (tokens) that the model can understand.
- Detokenizer Worker: Converts the numbers (tokens) generated by the model back into human-readable text.
- Scheduler Worker: The core worker process. In a multi-GPU setup, there is one Scheduler Worker for each GPU (referred to as a TP Rank). It manages the computation and resource allocation for that specific GPU.
The components communicate using ZeroMQ (ZMQ) for control messages and torch.distributed for tensor data exchange between TP ranks.
Request Lifecycle:
- User sends a request to the API Server.
- API Server forwards it to the Tokenizer.
- Tokenizer converts text to tokens and sends them to the Scheduler (Rank 0).
- Scheduler (Rank 0) broadcasts the request to all other Schedulers (if using multiple GPUs).
- All Schedulers schedule the request and trigger their local Engine to compute the next token.
- Scheduler (Rank 0) collects the output token and sends it to the Detokenizer.
- Detokenizer converts the token to text and sends it back to the API Server.
- API Server streams the result back to the User.
The source code is located in python/minisgl. Here is a breakdown of the modules for developers:
minisgl.core: Core dataclasses for request/batch state (Req,Batch) and sampling config (SamplingParams).minisgl.distributed: TP metadata/state helpers (DistributedInfo,set/get/try_get_tp_info) indistributed/tp.py.minisgl.engine: Per-TP worker runtime (Engine, config, graph runner, sampling glue).minisgl.scheduler: Scheduling pipeline (prefill/decode/table/cache managers + event loop + I/O mixins).minisgl.kvcache: KV cache manager interfaces and implementations (NaiveCacheManager,RadixCacheManager).minisgl.neuron: Neuron model loading and input-building adapters (model_loader.py,inputs.py).minisgl.message: Message schemas/serialization used between frontend, scheduler, and tokenizer workers.minisgl.server: CLI parsing, process launch orchestration, and FastAPI frontend server.minisgl.tokenizer: Tokenize/detokenize worker implementations.minisgl.llm: Offline/local Python interface (LLM) built on top of scheduler flow.minisgl.kernel: Low-leveltvm-ffikernels; currently used by radix cache via CPU kernelradix.cpp(fast_compare_key).minisgl.benchmark: Benchmark client helpers and result processing utilities.minisgl.utils: Shared utilities (logger, HF config loading, registries, ZMQ wrappers, torch helpers, misc).minisgl.env: Environment-variable backed runtime knobs.minisgl.shell: Interactive shell frontend.
