Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Faster-MoA-PD

Faster-MoA-PD is an experiment framework for latency-aware multi-agent LLM inference built on top of SGLang PD (prefill/decode) disaggregation.

It combines:

  • a patched SGLang server/runtime (src/sglang_ext/),
  • a shell router that composes multi-agent dependency graphs (shell_router.py, shell_router_dmv.py),
  • experiment drivers for real benchmark datasets (tb_real_dataset_agent_tree_structure*.py),
  • and reusable JSON configs under cfgs/.

This repo is focused on comparing two execution styles for tree-structured agent graphs:

  1. Baseline: conventional orchestration (blocking layer-by-layer orchestration).
  2. Proposed Faster-MoA path: dependency-aware prompt splicing with PD disaggregation and optional dynamic early-exit logic.

The experiment drivers run both modes (or only one, depending on flags/config) and write summary JSON results for latency and quality-style metrics.

High-level architecture

Implementation Diagram

The PD disaggregation workflow in Faster-MoA is stated above. It utilizes the native SGLang PD disaggregation capabilities for prefill/decode server separation, while implementing custom routing and orchestration logic in the shell router layer to enable prefill-only requests to fetch intermediate dependency outputs and splice them into downstream prompts without waiting for full decode completion.

1) Inference servers (PD disaggregation)

Each model is typically launched as one prefill server and one decode server, using launch_server.py, which calls the patched HTTP server/runtime in src/sglang_ext/.

2) sglang_router

Native SGLang router instances sit in front of prefill/decode server pairs.

3) Shell router (shell_router.py / shell_router_dmv.py)

The shell router:

  • parses dependency placeholders like <|answer:agent_x|>,
  • streams and caches dependency outputs/token logprobs,
  • pre-fills downstream prompts with dependency chunks,
  • optionally performs dynamic early-exit evaluation (q_eval) via embedding-based confidence/similarity.

4) Experiment runner

tb_real_dataset_agent_tree_structure.py and tb_real_dataset_agent_tree_structure_dmv.py:

  • load HF datasets,
  • build prompts from config templates,
  • issue batched requests via shell router,
  • compute dataset-specific correctness/extraction,
  • write per-run summary JSON artifacts.

Repository layout

.
├── launch_server.py                          # entrypoint for patched SGLang server
├── shell_router.py                           # shell router (dynamic early-exit capable)
├── shell_router_dmv.py                       # shell router variant used by dynamic majority vote (DMV) experiments
├── tb_real_dataset_agent_tree_structure.py   # experiment runner
├── tb_real_dataset_agent_tree_structure_dmv.py
├── run_router.sh
├── run_servers_*.sh                          # launcher scripts for different topologies
├── cfgs/
│   ├── hetero/                               # heterogeneous multi-model configs
│   ├── hetero_dmv/                           # heterogeneous configs + dynamic mode variants
│   ├── homo/                                 # homogeneous model configs by dataset
│   └── .isolation/
└── src/
    ├── sglang_ext/                           # patched SGLang runtime/server internals
    └── tb_utils/utils.py                     # embedding + Q evaluator utilities

Requirements

Hardware

  • Single-model minimal flow (run_servers_all_pd.sh): usually 2 GPUs (prefill + decode).
  • Heterogeneous 3-model flow (run_servers_hetero*_pd*.sh): typically 6 GPUs (3 prefill + 3 decode).

Software dependencies

python >= 3.10

Create and activate a virtual environment, then install:

python -m venv .new-venv
source .new-venv/bin/activate
pip install -U pip uv

uv pip install "transformers==4.57.0"
uv pip install nixl "sglang==0.5.3" sglang-router datasets --prerelease=allow

(Optional) Download the three Qwen models in advance, which will be used in experiments:

hf download --repo-type model Qwen/Qwen3-VL-4B-Instruct
hf download --repo-type model Qwen/Qwen3-VL-8B-Instruct
hf download --repo-type model Qwen/Qwen3-VL-32B-Instruct

Dataset/Test configurations

The runners and configs has already included flows for:

  • GSM8K
  • MMLU
  • AIME2025
  • GPQA-Diamond
  • HMMT Feb 2025
  • MATH-500
  • MMLU-ProX-Lite
  • IFBench

Dataset loading is handled through datasets.load_dataset(...), with per-dataset prompt and answer-extraction logic in the runner scripts.

Custom configuration files (cfgs/*.json)

Each config defines:

  • dataset
  • run_baseline, run_proposed, warmup
  • model_dict:
    • shell URL
    • router URL
    • optional prefill URL
  • prompt_templates
  • layers:
    • list of agents per layer,
    • dependency graph via depend_on,
    • model assignment via model_path,
    • sampling parameters.

A typical tree is 3 layers:

  1. leaf agents,
  2. middle aggregators (optionally dynamic EE-enabled),
  3. final synthesizer.

Quick start

A) 2-GPU single-model PD run (interactive shell router)

./run_servers_all_pd.sh

This script:

  1. launches prefill/decode servers,
  2. waits on /health_generate,
  3. starts sglang_router,
  4. starts shell_router.py.

B) Single-model experiment run

CONFIG_PATH=./cfgs/.isolation/real_dataset_tb_configs.json \
NUM_SAMPLES=85 \
QUESTION_BATCH_SIZE=1 \
./run_servers_all_pd_w_experiment.sh

This flow launches servers + routers, runs baseline and proposed experiments, then kills processes.

C) Heterogeneous 3-model experiment run

CONFIG_PATH=./cfgs/hetero/hetero_math500_wo_dmv.json \
NUM_SAMPLES=85 \
QUESTION_BATCH_SIZE=1 \
./run_servers_hetero_pd_w_experiment.sh

(Adjust the above parameters as needed for different configs/datasets.)

D) Dynamic Early-Exit/Majority Voting (DMV)-enabled heterogeneous run

CONFIG_PATH=./cfgs/hetero_dmv/hetero_math500_dmv_all.json \
NUM_SAMPLES=60 \
QUESTION_BATCH_SIZE=1 \
./run_servers_hetero_dmv_pd_w_experiment.sh

(Adjust the above parameters as needed for different configs/datasets.)

Important scripts

  • run_router.sh

    • lightweight command to start sglang_router in PD mode.
  • run_servers_all_pd.sh

    • starts one model pair (prefill/decode) and shell router.
  • run_servers_all_pd_w_experiment.sh

    • same as above + executes baseline/proposed experiment runs.
  • run_servers_hetero_pd.sh

    • starts 3 model pairs (4B / 8B / 32B) and associated routers.
  • run_servers_hetero_pd_w_experiment.sh

    • heterogeneous launch + experiment execution.
  • run_servers_hetero_dmv_pd.sh / run_servers_hetero_dmv_pd_w_experiment.sh

    • Early Exit/Majority Voting (DMV) enabled variants, including explicit prefill bootstrap ports.
  • run_servers_*_runs.sh

    • simple loop wrappers for running multiple configs sequentially.

Key environment variables

Commonly used variables across launch scripts and router code:

  • CONFIG_PATH (JSON config path)
  • NUM_SAMPLES
  • QUESTION_BATCH_SIZE
  • RUN_BASELINE, RUN_PROPOSED
  • MAX_STREAMING_TOKENS, MIN_STREAMING_TOKENS
  • CHUNK_IDLE_S, MAX_DEP_WAIT_S
  • PREFILL_TRACE
  • FORWARD_USE_TEXT
  • USE_DYNAMIC_EE, EVAL_FN (DMV script flow)

Outputs and logs

  • Runtime logs are placed under timestamped folders in logs/test_logs_<timestamp>/.
  • Experiment summaries are written as JSON files in repository root (ignored by .gitignore) with names like:
    • *_tb_summary_agent_tree_structure_results.json
    • *_tb_summary_agent_tree_structure_baseline_results.json

Notes / caveats

  • Scripts use pkill -f python in some flows; on shared machines this can kill unrelated Python processes.
  • Some scripts reference cfgs/homo/*; ensure those config folders exist in your checkout before batch runs.
  • Several scripts default to fixed GPU IDs (CUDA_VISIBLE_DEVICES=0..5) and ports; adapt if your node layout differs.

Minimal reproducible run checklist

  1. Create Python env and install dependencies.
  2. Verify GPUs are visible (nvidia-smi).
  3. Pick one config under cfgs/.
  4. Run matching launch script (run_servers_all_* or run_servers_hetero_*).
  5. Inspect logs/test_logs_<timestamp>/ for server/router/experiment logs.
  6. Inspect generated summary JSON for metrics.