A lightweight inference framework for Large Language Models.
Mini-SGLang-Neuron is a compact implementation of SGLang, designed to demystify modern LLM serving systems. This repository currently focuses on a Neuron/XLA-oriented runtime while keeping the codebase readable and modular.
- High Performance: Uses practical serving optimizations for strong throughput and latency.
- Lightweight & Readable: A clean, modular, and fully type-annotated codebase that is easy to understand and modify.
- Advanced Optimizations:
- Radix Cache: Reuses KV cache for shared prefixes across requests.
- Chunked Prefill: Reduces peak memory usage for long-context serving.
- Tensor Parallelism: Scales inference across TP ranks.
- Kernel Acceleration: Uses low-level kernels where needed (e.g., radix cache key comparison).
- ...
⚠️ Platform Support: Mini-SGLang-Neuron currently supports AWS Trainium (Trn) and Inferentia (Inf) only.
User can choose any cloud service that vends AWS Trn/Inf. We take Yotta Labs as an example for Trn1 instance setup.
- Go to https://www.yottalabs.ai/.
- Click
LAUNCH CONSOLEon the top right and login. - On the left side bar, click
Compute -> Virtual Machines. - Select Region
us-west-2, and select the providerAWS. - Choose
Trainium1and other desired configuration. ClickLaunchto start the virtual machine.
We recommend using Docker to spin up container with official pytorch-inference-neuronx image for fast setup. Below is an example to spin up the container:
docker run --pull=missing -it --rm \
--privileged \
--network host \
--shm-size=32g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
public.ecr.aws/neuron/pytorch-inference-neuronx:2.9.0-neuronx-py312-sdk2.27.1-ubuntu24.04 \
bash
Install Mini-SGLang directly from the source:
git clone https://github.com/yottalabsai/mini-sglang-neuron.git
cd mini-sglang-neuron && bash init_setup.shChat with your model directly in the terminal by adding the --shell-mode flag.
export TP_SIZE=2
export NEURON_RT_NUM_CORES="${TP_SIZE}"
python -m minisgl \
--model-path "Qwen/Qwen3-0.6B" \
--dtype bfloat16 \
--tp-size "$TP_SIZE" \
--max-running-requests 6 \
--max-seq-len-override 4096 \
--num-pages 10192 \
--port 1919 \
--shell-modeYou can also use /reset to clear the chat history.
Launch an OpenAI-compatible API server with a single command.
# Deploy Qwen/Qwen3-0.6B
export TP_SIZE=2
export NEURON_RT_NUM_CORES="${TP_SIZE}"
python -m minisgl \
--model-path "Qwen/Qwen3-0.6B" \
--dtype bfloat16 \
--tp-size "$TP_SIZE" \
--max-running-requests 6 \
--max-prefill-length 8192 \
--max-seq-len-override 2048 \
--num-pages 16384 \
--port 1919Once the server is running, you can send requests using standard tools like curl or any OpenAI-compatible client.
python3 benchmark/online/simple_call.py --prompt "hello"We use vllm-neuron as the comparison baseline for the profiling tests below.
Comparison installation command for vllm-neuron.
git clone https://github.com/vllm-project/vllm-neuron.git
cd vllm-neuron
pip install --extra-index-url=https://pip.repos.neuron.amazonaws.com -e .See bench.py for more details on mini-sglang-neuron profiling. See bench_vllm_neuron.py on vllm-neuron profiling.
Test Configuration:
- Hardware: trn1.xlarge, two Neuron Cores.
- Model: Qwen3-0.6B
- Total Requests: 256 sequences
- Input Length: Randomly sampled between 100-1024 tokens
- Output Length: Randomly sampled between 100-1024 tokens
See bench_qwen.py for more details.
Test Configuration:
- Hardware: trn1.xlarge, two Neuron Cores.
- Model: Qwen3-0.6B
- Dataset: Qwen trace, replaying first 500 requests with no more than 1024 input token length.
Server startup command for mini-sglang-neuron: see server.sh.
# mini-sglang-neuron (no radix)
export TP_SIZE=2
export NEURON_RT_NUM_CORES="${TP_SIZE}"
python -m minisgl \
--model-path Qwen/Qwen3-0.6B \
--dtype bfloat16 \
--tp-size "$TP_SIZE" \
--max-running-requests 6 \
--max-prefill-length 256 \
--max-seq-len-override 2048 \
--num-pages 16384 \
--port 1919 \
--cache-type naive # Change to "radix"
# mini-sglang-neuron (radix)
export TP_SIZE=2
export NEURON_RT_NUM_CORES="${TP_SIZE}"
python -m minisgl \
--model-path Qwen/Qwen3-0.6B \
--dtype bfloat16 \
--tp-size "$TP_SIZE" \
--max-running-requests 6 \
--max-prefill-length 256 \
--max-seq-len-override 2048 \
--num-pages 16384 \
--port 1919 \
--cache-type radixServer startup command for vllm-neuron.
# vllm-neuron (no prefix caching)
export TP_SIZE=2
export NEURON_RT_NUM_CORES="${TP_SIZE}"
vllm serve Qwen/Qwen3-0.6B \
--dtype bfloat16 \
--tensor-parallel-size "${TP_SIZE}" \
--max-num-seqs 6 \
--max-model-len 2048 \
--port 1919 \
--max-num-batched-tokens 256 \
--block-size 128 \
--num-gpu-blocks-override 6 \
--no-enable-prefix-caching
# vllm-neuron (prefix caching)
export TP_SIZE=2
export NEURON_RT_NUM_CORES="${TP_SIZE}"
vllm serve Qwen/Qwen3-0.6B \
--dtype bfloat16 \
--tensor-parallel-size "${TP_SIZE}" \
--max-num-seqs 6 \
--max-model-len 2048 \
--port 1919 \
--max-num-batched-tokens 256 \
--block-size 128 \
--num-gpu-blocks-override 128Client command:
python benchmark/online/bench_qwen.py- Detailed Features: Explore all available features and command-line arguments.
- System Architecture: Dive deep into the design and data flow of Mini-SGLang.

