Training sglang rollout timeout

My training raise exception in the middle of rollout generation:
```
Rollout generation:  62%|██████▎   | 40/64 [04:51<02:52,  7.19s/it]
Traceback (most recent call last):
  File "/export/home/pan/slime/train.py", line 100, in <module>
    train(args)
  File "/export/home/pan/slime/train.py", line 69, in train
    rollout_data_ref = ray.get(rollout_manager.generate.remote(rollout_id))
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/micromamba/envs/slime/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/micromamba/envs/slime/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/micromamba/envs/slime/lib/python3.12/site-packages/ray/_private/worker.py", line 2967, in get
    values, debugger_breakpoint = worker.get_objects(
                                  ^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/micromamba/envs/slime/lib/python3.12/site-packages/ray/_private/worker.py", line 1015, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TimeoutError): [36mray::RolloutManager.generate()[39m (pid=1694029, ip=134.100.9.210, actor_id=d937ea7a2c84d86b99661af502000000, repr=<slime.ray.rollout.RolloutManager object at 0x7fa42ba703b0>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/slime/slime/ray/rollout.py", line 135, in generate
    data, metrics = self._get_rollout_data(rollout_id=rollout_id)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/slime/slime/ray/rollout.py", line 229, in _get_rollout_data
    data = call_rollout_fn(self.generate_rollout, self.args, rollout_id, self.data_source, evaluation=False)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/slime/slime/rollout/base_types.py", line 20, in call_rollout_fn
    output = fn(*args, **kwargs, evaluation=evaluation)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/slime/slime/rollout/sglang_rollout.py", line 565, in generate_rollout
    output, aborted_samples = run(generate_rollout_async(args, rollout_id, data_source.get_samples))
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/slime/slime/utils/async_utils.py", line 36, in run
    return get_async_loop().run(coro)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/slime/slime/utils/async_utils.py", line 20, in run
    return asyncio.run_coroutine_threadsafe(coro, self.loop).result()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/micromamba/envs/slime/lib/python3.12/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/micromamba/envs/slime/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
TimeoutError
[36m(SGLangEngine pid=1694629)[0m [2026-02-07 14:34:05 TP0] Decode batch, #running-req: 13, #token: 46912, token usage: 0.03, cuda graph: False, gen throughput (token/s): 60.81, #queue-req: 0, 
```

I would like to ask how did you handle the issue of timeout that is probabily caused by overlong token generation. Shoud it automatically redo or abort this rollout rather than raising this issue? Is it already implemented?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training sglang rollout timeout #1562

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training sglang rollout timeout #1562

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions