Skip to content

Training sglang rollout timeout #1562

@p1k0pan

Description

@p1k0pan

My training raise exception in the middle of rollout generation:

Rollout generation:  62%|██████▎   | 40/64 [04:51<02:52,  7.19s/it]
Traceback (most recent call last):
  File "/export/home/pan/slime/train.py", line 100, in <module>
    train(args)
  File "/export/home/pan/slime/train.py", line 69, in train
    rollout_data_ref = ray.get(rollout_manager.generate.remote(rollout_id))
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/micromamba/envs/slime/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/micromamba/envs/slime/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/micromamba/envs/slime/lib/python3.12/site-packages/ray/_private/worker.py", line 2967, in get
    values, debugger_breakpoint = worker.get_objects(
                                  ^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/micromamba/envs/slime/lib/python3.12/site-packages/ray/_private/worker.py", line 1015, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TimeoutError): �[36mray::RolloutManager.generate()�[39m (pid=1694029, ip=134.100.9.210, actor_id=d937ea7a2c84d86b99661af502000000, repr=<slime.ray.rollout.RolloutManager object at 0x7fa42ba703b0>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/slime/slime/ray/rollout.py", line 135, in generate
    data, metrics = self._get_rollout_data(rollout_id=rollout_id)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/slime/slime/ray/rollout.py", line 229, in _get_rollout_data
    data = call_rollout_fn(self.generate_rollout, self.args, rollout_id, self.data_source, evaluation=False)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/slime/slime/rollout/base_types.py", line 20, in call_rollout_fn
    output = fn(*args, **kwargs, evaluation=evaluation)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/slime/slime/rollout/sglang_rollout.py", line 565, in generate_rollout
    output, aborted_samples = run(generate_rollout_async(args, rollout_id, data_source.get_samples))
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/slime/slime/utils/async_utils.py", line 36, in run
    return get_async_loop().run(coro)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/slime/slime/utils/async_utils.py", line 20, in run
    return asyncio.run_coroutine_threadsafe(coro, self.loop).result()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/micromamba/envs/slime/lib/python3.12/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/export/home/pan/micromamba/envs/slime/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
TimeoutError
�[36m(SGLangEngine pid=1694629)�[0m [2026-02-07 14:34:05 TP0] Decode batch, #running-req: 13, #token: 46912, token usage: 0.03, cuda graph: False, gen throughput (token/s): 60.81, #queue-req: 0, 

I would like to ask how did you handle the issue of timeout that is probabily caused by overlong token generation. Shoud it automatically redo or abort this rollout rather than raising this issue? Is it already implemented?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions