Name	Name	Last commit message	Last commit date
parent directory ..
all_real_dataset_results	all_real_dataset_results
assets	assets
cfgs	cfgs
src	src
README.md	README.md
launch_server.py	launch_server.py
run_router.sh	run_router.sh
run_servers_all_pd.sh	run_servers_all_pd.sh
run_servers_all_pd_w_experiment.sh	run_servers_all_pd_w_experiment.sh
run_servers_all_pd_w_experiment_runs.sh	run_servers_all_pd_w_experiment_runs.sh
run_servers_hetero_dmv_pd.sh	run_servers_hetero_dmv_pd.sh
run_servers_hetero_dmv_pd_w_experiment.sh	run_servers_hetero_dmv_pd_w_experiment.sh
run_servers_hetero_dmv_pd_w_experiment_runs.sh	run_servers_hetero_dmv_pd_w_experiment_runs.sh
run_servers_hetero_pd.sh	run_servers_hetero_pd.sh
run_servers_hetero_pd_w_experiment.sh	run_servers_hetero_pd_w_experiment.sh
run_servers_hetero_pd_w_experiment_runs.sh	run_servers_hetero_pd_w_experiment_runs.sh
shell_router.py	shell_router.py
shell_router_dmv.py	shell_router_dmv.py
tb_real_dataset_agent_tree_structure.py	tb_real_dataset_agent_tree_structure.py
tb_real_dataset_agent_tree_structure_dmv.py	tb_real_dataset_agent_tree_structure_dmv.py

Faster-MoA-PD

Faster-MoA-PD is an experiment framework for latency-aware multi-agent LLM inference built on top of SGLang PD (prefill/decode) disaggregation.

It combines:

a patched SGLang server/runtime (src/sglang_ext/),
a shell router that composes multi-agent dependency graphs (shell_router.py, shell_router_dmv.py),
experiment drivers for real benchmark datasets (tb_real_dataset_agent_tree_structure*.py),
and reusable JSON configs under cfgs/.

This repo is focused on comparing two execution styles for tree-structured agent graphs:

Baseline: conventional orchestration (blocking layer-by-layer orchestration).
Proposed Faster-MoA path: dependency-aware prompt splicing with PD disaggregation and optional dynamic early-exit logic.

The experiment drivers run both modes (or only one, depending on flags/config) and write summary JSON results for latency and quality-style metrics.

High-level architecture

The PD disaggregation workflow in Faster-MoA is stated above. It utilizes the native SGLang PD disaggregation capabilities for prefill/decode server separation, while implementing custom routing and orchestration logic in the shell router layer to enable prefill-only requests to fetch intermediate dependency outputs and splice them into downstream prompts without waiting for full decode completion.

1) Inference servers (PD disaggregation)

Each model is typically launched as one prefill server and one decode server, using launch_server.py, which calls the patched HTTP server/runtime in src/sglang_ext/.

2) `sglang_router`

Native SGLang router instances sit in front of prefill/decode server pairs.

3) Shell router (`shell_router.py` / `shell_router_dmv.py`)

The shell router:

parses dependency placeholders like <|answer:agent_x|>,
streams and caches dependency outputs/token logprobs,
pre-fills downstream prompts with dependency chunks,
optionally performs dynamic early-exit evaluation (q_eval) via embedding-based confidence/similarity.

4) Experiment runner

tb_real_dataset_agent_tree_structure.py and tb_real_dataset_agent_tree_structure_dmv.py:

load HF datasets,
build prompts from config templates,
issue batched requests via shell router,
compute dataset-specific correctness/extraction,
write per-run summary JSON artifacts.

Repository layout

.
├── launch_server.py                          # entrypoint for patched SGLang server
├── shell_router.py                           # shell router (dynamic early-exit capable)
├── shell_router_dmv.py                       # shell router variant used by dynamic majority vote (DMV) experiments
├── tb_real_dataset_agent_tree_structure.py   # experiment runner
├── tb_real_dataset_agent_tree_structure_dmv.py
├── run_router.sh
├── run_servers_*.sh                          # launcher scripts for different topologies
├── cfgs/
│   ├── hetero/                               # heterogeneous multi-model configs
│   ├── hetero_dmv/                           # heterogeneous configs + dynamic mode variants
│   ├── homo/                                 # homogeneous model configs by dataset
│   └── .isolation/
└── src/
    ├── sglang_ext/                           # patched SGLang runtime/server internals
    └── tb_utils/utils.py                     # embedding + Q evaluator utilities

Requirements

Hardware

Single-model minimal flow (run_servers_all_pd.sh): usually 2 GPUs (prefill + decode).
Heterogeneous 3-model flow (run_servers_hetero*_pd*.sh): typically 6 GPUs (3 prefill + 3 decode).

Software dependencies

python >= 3.10

Create and activate a virtual environment, then install:

python -m venv .new-venv
source .new-venv/bin/activate
pip install -U pip uv

uv pip install "transformers==4.57.0"
uv pip install nixl "sglang==0.5.3" sglang-router datasets --prerelease=allow

(Optional) Download the three Qwen models in advance, which will be used in experiments:

hf download --repo-type model Qwen/Qwen3-VL-4B-Instruct
hf download --repo-type model Qwen/Qwen3-VL-8B-Instruct
hf download --repo-type model Qwen/Qwen3-VL-32B-Instruct

Dataset/Test configurations

The runners and configs has already included flows for:

GSM8K
MMLU
AIME2025
GPQA-Diamond
HMMT Feb 2025
MATH-500
MMLU-ProX-Lite
IFBench

Dataset loading is handled through datasets.load_dataset(...), with per-dataset prompt and answer-extraction logic in the runner scripts.

Custom configuration files (`cfgs/*.json`)

Each config defines:

dataset
run_baseline, run_proposed, warmup
model_dict:
- shell URL
- router URL
- optional prefill URL
prompt_templates
layers:
- list of agents per layer,
- dependency graph via depend_on,
- model assignment via model_path,
- sampling parameters.

A typical tree is 3 layers:

leaf agents,
middle aggregators (optionally dynamic EE-enabled),
final synthesizer.

Quick start

A) 2-GPU single-model PD run (interactive shell router)

./run_servers_all_pd.sh

This script:

launches prefill/decode servers,
waits on /health_generate,
starts sglang_router,
starts shell_router.py.

B) Single-model experiment run

CONFIG_PATH=./cfgs/.isolation/real_dataset_tb_configs.json \
NUM_SAMPLES=85 \
QUESTION_BATCH_SIZE=1 \
./run_servers_all_pd_w_experiment.sh

This flow launches servers + routers, runs baseline and proposed experiments, then kills processes.

C) Heterogeneous 3-model experiment run

CONFIG_PATH=./cfgs/hetero/hetero_math500_wo_dmv.json \
NUM_SAMPLES=85 \
QUESTION_BATCH_SIZE=1 \
./run_servers_hetero_pd_w_experiment.sh

(Adjust the above parameters as needed for different configs/datasets.)

D) Dynamic Early-Exit/Majority Voting (DMV)-enabled heterogeneous run

CONFIG_PATH=./cfgs/hetero_dmv/hetero_math500_dmv_all.json \
NUM_SAMPLES=60 \
QUESTION_BATCH_SIZE=1 \
./run_servers_hetero_dmv_pd_w_experiment.sh

(Adjust the above parameters as needed for different configs/datasets.)

Important scripts

run_router.sh
- lightweight command to start sglang_router in PD mode.
run_servers_all_pd.sh
- starts one model pair (prefill/decode) and shell router.
run_servers_all_pd_w_experiment.sh
- same as above + executes baseline/proposed experiment runs.
run_servers_hetero_pd.sh
- starts 3 model pairs (4B / 8B / 32B) and associated routers.
run_servers_hetero_pd_w_experiment.sh
- heterogeneous launch + experiment execution.
run_servers_hetero_dmv_pd.sh / run_servers_hetero_dmv_pd_w_experiment.sh
- Early Exit/Majority Voting (DMV) enabled variants, including explicit prefill bootstrap ports.
run_servers_*_runs.sh
- simple loop wrappers for running multiple configs sequentially.

Key environment variables

Commonly used variables across launch scripts and router code:

CONFIG_PATH (JSON config path)
NUM_SAMPLES
QUESTION_BATCH_SIZE
RUN_BASELINE, RUN_PROPOSED
MAX_STREAMING_TOKENS, MIN_STREAMING_TOKENS
CHUNK_IDLE_S, MAX_DEP_WAIT_S
PREFILL_TRACE
FORWARD_USE_TEXT
USE_DYNAMIC_EE, EVAL_FN (DMV script flow)

Outputs and logs

Runtime logs are placed under timestamped folders in logs/test_logs_<timestamp>/.
Experiment summaries are written as JSON files in repository root (ignored by .gitignore) with names like:
- *_tb_summary_agent_tree_structure_results.json
- *_tb_summary_agent_tree_structure_baseline_results.json

Notes / caveats

Scripts use pkill -f python in some flows; on shared machines this can kill unrelated Python processes.
Some scripts reference cfgs/homo/*; ensure those config folders exist in your checkout before batch runs.
Several scripts default to fixed GPU IDs (CUDA_VISIBLE_DEVICES=0..5) and ports; adapt if your node layout differs.

Minimal reproducible run checklist

Create Python env and install dependencies.
Verify GPUs are visible (nvidia-smi).
Pick one config under cfgs/.
Run matching launch script (run_servers_all_* or run_servers_hetero_*).
Inspect logs/test_logs_<timestamp>/ for server/router/experiment logs.
Inspect generated summary JSON for metrics.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Faster-MoA-PD

High-level architecture

1) Inference servers (PD disaggregation)

2) `sglang_router`

3) Shell router (`shell_router.py` / `shell_router_dmv.py`)

4) Experiment runner

Repository layout

Requirements

Hardware

Software dependencies

Dataset/Test configurations

Custom configuration files (`cfgs/*.json`)

Quick start

A) 2-GPU single-model PD run (interactive shell router)

B) Single-model experiment run

C) Heterogeneous 3-model experiment run

D) Dynamic Early-Exit/Majority Voting (DMV)-enabled heterogeneous run

Important scripts

Key environment variables

Outputs and logs

Notes / caveats

Minimal reproducible run checklist

FilesExpand file tree

Faster-MoA-PD

Directory actions

More options

Directory actions

More options

Latest commit

History

Faster-MoA-PD

Folders and files

parent directory

README.md

Faster-MoA-PD

High-level architecture

1) Inference servers (PD disaggregation)

2) sglang_router

3) Shell router (shell_router.py / shell_router_dmv.py)

4) Experiment runner

Repository layout

Requirements

Hardware

Software dependencies

Dataset/Test configurations

Custom configuration files (cfgs/*.json)

Quick start

A) 2-GPU single-model PD run (interactive shell router)

B) Single-model experiment run

C) Heterogeneous 3-model experiment run

D) Dynamic Early-Exit/Majority Voting (DMV)-enabled heterogeneous run

Important scripts

Key environment variables

Outputs and logs

Notes / caveats

Minimal reproducible run checklist

2) `sglang_router`

3) Shell router (`shell_router.py` / `shell_router_dmv.py`)

Custom configuration files (`cfgs/*.json`)