Skip to content

[multi_agent] rollout generation stalls for a long time before any /generate fires #979

@GuanxingLu

Description

@GuanxingLu

Description

Running the qwen3-30B-A3B multi-agent example, after resume_memory_occupation returns, the rollout sits at Rollout generation: 0/256 with GPUs at 0% util. The sglang engine only sees periodic GET /health, where no POST /generate shows up for tens of seconds (sometimes indefinitely).

Looks like the event loop is blocked on the client side: examples/multi_agent/rollout_with_multi_agents.py calls load_tokenizer(args.hf_checkpoint, ...) on every sample, and examples/multi_agent/agent_system.py then does deepcopy(args) with the tokenizer attached. Under the GIL, deepcopying a PreTrainedTokenizer once per sample basically freezes everything.

Reproduction

bash examples/multi_agent/run-qwen3-30B-A3B-multi-agent.sh

Model Qwen3-30B-A3B-Instruct-2507, rollout_batch_size=32, n_samples_per_prompt=8, 8×L20X, hf checkpoint on a shared FS.

Traceback

No exception. During the stall:

INFO: .../resume_memory_occupation HTTP/1.1 200 OK
Rollout generation:   0%|          | 0/256 [00:00<?, ?it/s]
INFO: .../health HTTP/1.1 200 OK
INFO: .../health HTTP/1.1 200 OK
...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions