Description
Running the qwen3-30B-A3B multi-agent example, after resume_memory_occupation returns, the rollout sits at Rollout generation: 0/256 with GPUs at 0% util. The sglang engine only sees periodic GET /health, where no POST /generate shows up for tens of seconds (sometimes indefinitely).
Looks like the event loop is blocked on the client side: examples/multi_agent/rollout_with_multi_agents.py calls load_tokenizer(args.hf_checkpoint, ...) on every sample, and examples/multi_agent/agent_system.py then does deepcopy(args) with the tokenizer attached. Under the GIL, deepcopying a PreTrainedTokenizer once per sample basically freezes everything.
Reproduction
bash examples/multi_agent/run-qwen3-30B-A3B-multi-agent.sh
Model Qwen3-30B-A3B-Instruct-2507, rollout_batch_size=32, n_samples_per_prompt=8, 8×L20X, hf checkpoint on a shared FS.
Traceback
No exception. During the stall:
INFO: .../resume_memory_occupation HTTP/1.1 200 OK
Rollout generation: 0%| | 0/256 [00:00<?, ?it/s]
INFO: .../health HTTP/1.1 200 OK
INFO: .../health HTTP/1.1 200 OK
...
Description
Running the qwen3-30B-A3B multi-agent example, after
resume_memory_occupationreturns, the rollout sits atRollout generation: 0/256with GPUs at 0% util. The sglang engine only sees periodicGET /health, where noPOST /generateshows up for tens of seconds (sometimes indefinitely).Looks like the event loop is blocked on the client side:
examples/multi_agent/rollout_with_multi_agents.pycallsload_tokenizer(args.hf_checkpoint, ...)on every sample, andexamples/multi_agent/agent_system.pythen doesdeepcopy(args)with the tokenizer attached. Under the GIL, deepcopying aPreTrainedTokenizeronce per sample basically freezes everything.Reproduction
Model
Qwen3-30B-A3B-Instruct-2507, rollout_batch_size=32, n_samples_per_prompt=8, 8×L20X, hf checkpoint on a shared FS.Traceback
No exception. During the stall: