Faster-MoA-PD is an experiment framework for latency-aware multi-agent LLM inference built on top of SGLang PD (prefill/decode) disaggregation.
It combines:
- a patched SGLang server/runtime (
src/sglang_ext/), - a shell router that composes multi-agent dependency graphs (
shell_router.py,shell_router_dmv.py), - experiment drivers for real benchmark datasets (
tb_real_dataset_agent_tree_structure*.py), - and reusable JSON configs under
cfgs/.
This repo is focused on comparing two execution styles for tree-structured agent graphs:
- Baseline: conventional orchestration (blocking layer-by-layer orchestration).
- Proposed Faster-MoA path: dependency-aware prompt splicing with PD disaggregation and optional dynamic early-exit logic.
The experiment drivers run both modes (or only one, depending on flags/config) and write summary JSON results for latency and quality-style metrics.
The PD disaggregation workflow in Faster-MoA is stated above. It utilizes the native SGLang PD disaggregation capabilities for prefill/decode server separation, while implementing custom routing and orchestration logic in the shell router layer to enable prefill-only requests to fetch intermediate dependency outputs and splice them into downstream prompts without waiting for full decode completion.
Each model is typically launched as one prefill server and one decode server, using launch_server.py, which calls the patched HTTP server/runtime in src/sglang_ext/.
Native SGLang router instances sit in front of prefill/decode server pairs.
The shell router:
- parses dependency placeholders like
<|answer:agent_x|>, - streams and caches dependency outputs/token logprobs,
- pre-fills downstream prompts with dependency chunks,
- optionally performs dynamic early-exit evaluation (
q_eval) via embedding-based confidence/similarity.
tb_real_dataset_agent_tree_structure.py and tb_real_dataset_agent_tree_structure_dmv.py:
- load HF datasets,
- build prompts from config templates,
- issue batched requests via shell router,
- compute dataset-specific correctness/extraction,
- write per-run summary JSON artifacts.
.
├── launch_server.py # entrypoint for patched SGLang server
├── shell_router.py # shell router (dynamic early-exit capable)
├── shell_router_dmv.py # shell router variant used by dynamic majority vote (DMV) experiments
├── tb_real_dataset_agent_tree_structure.py # experiment runner
├── tb_real_dataset_agent_tree_structure_dmv.py
├── run_router.sh
├── run_servers_*.sh # launcher scripts for different topologies
├── cfgs/
│ ├── hetero/ # heterogeneous multi-model configs
│ ├── hetero_dmv/ # heterogeneous configs + dynamic mode variants
│ ├── homo/ # homogeneous model configs by dataset
│ └── .isolation/
└── src/
├── sglang_ext/ # patched SGLang runtime/server internals
└── tb_utils/utils.py # embedding + Q evaluator utilities
- Single-model minimal flow (
run_servers_all_pd.sh): usually 2 GPUs (prefill + decode). - Heterogeneous 3-model flow (
run_servers_hetero*_pd*.sh): typically 6 GPUs (3 prefill + 3 decode).
python >= 3.10
Create and activate a virtual environment, then install:
python -m venv .new-venv
source .new-venv/bin/activate
pip install -U pip uv
uv pip install "transformers==4.57.0"
uv pip install nixl "sglang==0.5.3" sglang-router datasets --prerelease=allow(Optional) Download the three Qwen models in advance, which will be used in experiments:
hf download --repo-type model Qwen/Qwen3-VL-4B-Instruct
hf download --repo-type model Qwen/Qwen3-VL-8B-Instruct
hf download --repo-type model Qwen/Qwen3-VL-32B-InstructThe runners and configs has already included flows for:
- GSM8K
- MMLU
- AIME2025
- GPQA-Diamond
- HMMT Feb 2025
- MATH-500
- MMLU-ProX-Lite
- IFBench
Dataset loading is handled through datasets.load_dataset(...), with per-dataset prompt and answer-extraction logic in the runner scripts.
Each config defines:
datasetrun_baseline,run_proposed,warmupmodel_dict:- shell URL
- router URL
- optional prefill URL
prompt_templateslayers:- list of agents per layer,
- dependency graph via
depend_on, - model assignment via
model_path, - sampling parameters.
A typical tree is 3 layers:
- leaf agents,
- middle aggregators (optionally dynamic EE-enabled),
- final synthesizer.
./run_servers_all_pd.shThis script:
- launches prefill/decode servers,
- waits on
/health_generate, - starts
sglang_router, - starts
shell_router.py.
CONFIG_PATH=./cfgs/.isolation/real_dataset_tb_configs.json \
NUM_SAMPLES=85 \
QUESTION_BATCH_SIZE=1 \
./run_servers_all_pd_w_experiment.shThis flow launches servers + routers, runs baseline and proposed experiments, then kills processes.
CONFIG_PATH=./cfgs/hetero/hetero_math500_wo_dmv.json \
NUM_SAMPLES=85 \
QUESTION_BATCH_SIZE=1 \
./run_servers_hetero_pd_w_experiment.sh(Adjust the above parameters as needed for different configs/datasets.)
CONFIG_PATH=./cfgs/hetero_dmv/hetero_math500_dmv_all.json \
NUM_SAMPLES=60 \
QUESTION_BATCH_SIZE=1 \
./run_servers_hetero_dmv_pd_w_experiment.sh(Adjust the above parameters as needed for different configs/datasets.)
-
run_router.sh- lightweight command to start
sglang_routerin PD mode.
- lightweight command to start
-
run_servers_all_pd.sh- starts one model pair (prefill/decode) and shell router.
-
run_servers_all_pd_w_experiment.sh- same as above + executes baseline/proposed experiment runs.
-
run_servers_hetero_pd.sh- starts 3 model pairs (4B / 8B / 32B) and associated routers.
-
run_servers_hetero_pd_w_experiment.sh- heterogeneous launch + experiment execution.
-
run_servers_hetero_dmv_pd.sh/run_servers_hetero_dmv_pd_w_experiment.sh- Early Exit/Majority Voting (DMV) enabled variants, including explicit prefill bootstrap ports.
-
run_servers_*_runs.sh- simple loop wrappers for running multiple configs sequentially.
Commonly used variables across launch scripts and router code:
CONFIG_PATH(JSON config path)NUM_SAMPLESQUESTION_BATCH_SIZERUN_BASELINE,RUN_PROPOSEDMAX_STREAMING_TOKENS,MIN_STREAMING_TOKENSCHUNK_IDLE_S,MAX_DEP_WAIT_SPREFILL_TRACEFORWARD_USE_TEXTUSE_DYNAMIC_EE,EVAL_FN(DMV script flow)
- Runtime logs are placed under timestamped folders in
logs/test_logs_<timestamp>/. - Experiment summaries are written as JSON files in repository root (ignored by
.gitignore) with names like:*_tb_summary_agent_tree_structure_results.json*_tb_summary_agent_tree_structure_baseline_results.json
- Scripts use
pkill -f pythonin some flows; on shared machines this can kill unrelated Python processes. - Some scripts reference
cfgs/homo/*; ensure those config folders exist in your checkout before batch runs. - Several scripts default to fixed GPU IDs (
CUDA_VISIBLE_DEVICES=0..5) and ports; adapt if your node layout differs.
- Create Python env and install dependencies.
- Verify GPUs are visible (
nvidia-smi). - Pick one config under
cfgs/. - Run matching launch script (
run_servers_all_*orrun_servers_hetero_*). - Inspect
logs/test_logs_<timestamp>/for server/router/experiment logs. - Inspect generated summary JSON for metrics.
