-
Notifications
You must be signed in to change notification settings - Fork 581
Open
Description
My training raise exception in the middle of rollout generation:
Rollout generation: 62%|██████▎ | 40/64 [04:51<02:52, 7.19s/it]
Traceback (most recent call last):
File "/export/home/pan/slime/train.py", line 100, in <module>
train(args)
File "/export/home/pan/slime/train.py", line 69, in train
rollout_data_ref = ray.get(rollout_manager.generate.remote(rollout_id))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/export/home/pan/micromamba/envs/slime/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/export/home/pan/micromamba/envs/slime/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/export/home/pan/micromamba/envs/slime/lib/python3.12/site-packages/ray/_private/worker.py", line 2967, in get
values, debugger_breakpoint = worker.get_objects(
^^^^^^^^^^^^^^^^^^^
File "/export/home/pan/micromamba/envs/slime/lib/python3.12/site-packages/ray/_private/worker.py", line 1015, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TimeoutError): �[36mray::RolloutManager.generate()�[39m (pid=1694029, ip=134.100.9.210, actor_id=d937ea7a2c84d86b99661af502000000, repr=<slime.ray.rollout.RolloutManager object at 0x7fa42ba703b0>)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/export/home/pan/slime/slime/ray/rollout.py", line 135, in generate
data, metrics = self._get_rollout_data(rollout_id=rollout_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/export/home/pan/slime/slime/ray/rollout.py", line 229, in _get_rollout_data
data = call_rollout_fn(self.generate_rollout, self.args, rollout_id, self.data_source, evaluation=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/export/home/pan/slime/slime/rollout/base_types.py", line 20, in call_rollout_fn
output = fn(*args, **kwargs, evaluation=evaluation)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/export/home/pan/slime/slime/rollout/sglang_rollout.py", line 565, in generate_rollout
output, aborted_samples = run(generate_rollout_async(args, rollout_id, data_source.get_samples))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/export/home/pan/slime/slime/utils/async_utils.py", line 36, in run
return get_async_loop().run(coro)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/export/home/pan/slime/slime/utils/async_utils.py", line 20, in run
return asyncio.run_coroutine_threadsafe(coro, self.loop).result()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/export/home/pan/micromamba/envs/slime/lib/python3.12/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/export/home/pan/micromamba/envs/slime/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
TimeoutError
�[36m(SGLangEngine pid=1694629)�[0m [2026-02-07 14:34:05 TP0] Decode batch, #running-req: 13, #token: 46912, token usage: 0.03, cuda graph: False, gen throughput (token/s): 60.81, #queue-req: 0,
I would like to ask how did you handle the issue of timeout that is probabily caused by overlong token generation. Shoud it automatically redo or abort this rollout rather than raising this issue? Is it already implemented?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels