-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
[2026-03-12 11:50:31] �[36m(AgentLoopWorker pid=3498642)�[0m "description": "The model's answer to the GSM8K math problem, must be a digits"�[32m [repeated 1310x across cluster]�[0m
[2026-03-12 11:50:31] �[36m(AgentLoopWorker pid=3498642)�[0m }�[32m [repeated 5240x across cluster]�[0m
[2026-03-12 11:50:31] �[36m(AgentLoopWorker pid=3498642)�[0m },�[32m [repeated 1310x across cluster]�[0m
[2026-03-12 11:50:31] �[36m(AgentLoopWorker pid=3498642)�[0m "required": [�[32m [repeated 1310x across cluster]�[0m
[2026-03-12 11:50:31] �[36m(AgentLoopWorker pid=3498642)�[0m "answer"�[32m [repeated 1310x across cluster]�[0m
[2026-03-12 11:50:31] �[36m(AgentLoopWorker pid=3498642)�[0m ]�[32m [repeated 1310x across cluster]�[0m
[2026-03-12 11:50:31] �[36m(TaskRunner pid=3483799)�[0m ("Initial validation metrics: {'val-aux/openai/gsm8k/reward/mean@1': "
[2026-03-12 11:50:31] �[36m(TaskRunner pid=3483799)�[0m "0.8021228203184231, 'val-core/openai/gsm8k/acc/mean@1': 0.8021228203184231, "
[2026-03-12 11:50:31] �[36m(TaskRunner pid=3483799)�[0m "'val-aux/num_turns/min': 2, 'val-aux/num_turns/max': 2, "
[2026-03-12 11:50:31] �[36m(TaskRunner pid=3483799)�[0m "'val-aux/num_turns/mean': 2.0}")
[2026-03-12 11:50:31] �[36m(TaskRunner pid=3483799)�[0m step:0 - val-aux/openai/gsm8k/reward/mean@1:0.8021228203184231 - val-core/openai/gsm8k/acc/mean@1:0.8021228203184231 - val-aux/num_turns/min:2 - val-aux/num_turns/max:2 - val-aux/num_turns/mean:2.0
[2026-03-12 11:50:31] �[36m(AgentLoopWorker pid=3498640)�[0m
[2026-03-12 11:50:31] �[36m(AgentLoopWorker pid=3498642)�[0m
[2026-03-12 11:50:31] �[36m(AgentLoopWorker pid=3498643)�[0m
[2026-03-12 11:50:31] �[36m(AgentLoopWorker pid=3498661)�[0m
[2026-03-12 11:50:31] �[36m(AgentLoopWorker pid=3498639)�[0m
[2026-03-12 11:50:31] �[36m(AgentLoopWorker pid=3498642)�[0m {�[32m [repeated 1315x across cluster]�[0m
[2026-03-12 11:50:36] �[36m(SGLangHttpServer pid=3491208)�[0m [2026-03-12 11:50:36,021 C 3491208 3491628] task_receiver.cc:181: An unexpected system state has occurred. You have likely discovered a bug in Ray. Please report this issue at https://github.com/ray-project/ray/issues and we'll work with you to fix it. Check failed: objects_valid
[2026-03-12 11:50:36] �[36m(SGLangHttpServer pid=3491208)�[0m *** StackTrace Information ***
[2026-03-12 11:50:36] �[36m(SGLangHttpServer pid=3491208)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/_raylet.so(+0x152da9a) [0x7f47b476fa9a] ray::operator<<()
[2026-03-12 11:50:36] �[36m(SGLangHttpServer pid=3491208)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x481) [0x7f47b4771e81] ray::RayLog::~RayLog()
[2026-03-12 11:50:36] �[36m(SGLangHttpServer pid=3491208)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/_raylet.so(+0xa22851) [0x7f47b3c64851] ray::core::TaskReceiver::HandleTask()::{lambda()#1}::operator()()
[2026-03-12 11:50:36] �[36m(SGLangHttpServer pid=3491208)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/_raylet.so(+0xa3ec92) [0x7f47b3c80c92] ray::core::InboundRequest::Accept()
[2026-03-12 11:50:36] �[36m(SGLangHttpServer pid=3491208)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/_raylet.so(_ZN3ray4core30OutOfOrderActorSchedulingQueue31AcceptRequestOrRejectIfCanceledENS_6TaskIDERNS0_14InboundRequestE+0x11c) [0x7f47b3c75a9c] ray::core::OutOfOrderActorSchedulingQueue::AcceptRequestOrRejectIfCanceled()
[2026-03-12 11:50:36] �[36m(SGLangHttpServer pid=3491208)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/_raylet.so(+0xa344fb) [0x7f47b3c764fb] std::_Function_handler<>::_M_invoke()
[2026-03-12 11:50:36] �[36m(SGLangHttpServer pid=3491208)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/_raylet.so(+0xa31717) [0x7f47b3c73717] std::_Function_handler<>::_M_invoke()
[2026-03-12 11:50:36] �[36m(SGLangHttpServer pid=3491208)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/raylet.so(+0xa361c5) [0x7f47b3c781c5] boost::fibers::worker_context<>::run()
[2026-03-12 11:50:36] �[36m(SGLangHttpServer pid=3491208)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/_raylet.so(+0xa35e10) [0x7f47b3c77e10] boost::context::detail::fiber_entry<>()
[2026-03-12 11:50:36] �[36m(SGLangHttpServer pid=3491208)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/_raylet.so(+0xa3e8ff) [0x7f47b3c808ff] make_fcontext
[2026-03-12 11:50:36] �[36m(SGLangHttpServer pid=3491208)�[0m
[2026-03-12 11:50:36] �[36m(SGLangHttpServer pid=3491210)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/_raylet.so(+0xa22851) [0x7f170ed4a851] ray::core::TaskReceiver::HandleTask()::{lambda()#1}::operator()()
[2026-03-12 11:50:36] �[36m(SGLangHttpServer pid=3491210)�[0m
[2026-03-12 11:50:36] �[36m(SGLangHttpServer pid=3491209)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/_raylet.so(+0xa22851) [0x7f4ecb71c851] ray::core::TaskReceiver::HandleTask()::{lambda()#1}::operator()()
[2026-03-12 11:50:36] �[36m(SGLangHttpServer pid=3491209)�[0m
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): �[36mray::AgentLoopWorker.generate_sequences()�[39m (pid=3498650, ip=10.21.175.69, actor_id=1e3c375da50951efd38d986a01000000, repr=<verl.experimental.agent_loop.agent_loop.AgentLoopWorker object at 0x7efa1c0803e0>)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/concurrent/futures/_base.py", line 449, in result
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m return self.__get_result()
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m raise self._exception
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/utils/transferqueue_utils.py", line 319, in dummy_async_inner
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output = await func(*args, **kwargs)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 479, in generate_sequences
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m outputs = await asyncio.gather(*tasks)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 516, in _run_agent_loop
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output: AgentLoopOutput = await agent_loop.run(sampling_params, **kwargs)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/single_turn_agent_loop.py", line 59, in run
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output = await self.server_manager.generate(
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/utils/rollout_trace.py", line 191, in async_wrapper
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m return await func(self, *args, **kwargs)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 115, in generate
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output = await server.generate.remote(
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m class_name: SGLangHttpServer
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m actor_id: 977b53d3f398f5ce457569ec01000000
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m pid: 3491208
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m name: sglang_server_1_0
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m namespace: 98d581c9-a558-40aa-89ac-c76103ef6fa4
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ip: 10.21.175.69
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
[2026-03-12 11:50:36] �[36m(SGLangHttpServer pid=3491207)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/_raylet.so(+0xa22851) [0x7f5199096851] ray::core::TaskReceiver::HandleTask()::{lambda()#1}::operator()()
[2026-03-12 11:50:36] �[36m(SGLangHttpServer pid=3491207)�[0m
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): �[36mray::AgentLoopWorker.generate_sequences()�[39m (pid=3498643, ip=10.21.175.69, actor_id=f3ab02a72c1e4c10c9548e4c01000000, repr=<verl.experimental.agent_loop.agent_loop.AgentLoopWorker object at 0x7f5affbb8080>)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/concurrent/futures/_base.py", line 456, in result
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m return self.__get_result()
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m raise self._exception
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/utils/transferqueue_utils.py", line 319, in dummy_async_inner
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output = await func(*args, **kwargs)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 479, in generate_sequences
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m outputs = await asyncio.gather(*tasks)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 516, in _run_agent_loop
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output: AgentLoopOutput = await agent_loop.run(sampling_params, **kwargs)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/single_turn_agent_loop.py", line 59, in run
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output = await self.server_manager.generate(
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/utils/rollout_trace.py", line 191, in async_wrapper
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m return await func(self, *args, **kwargs)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 115, in generate
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output = await server.generate.remote(
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m class_name: SGLangHttpServer
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m actor_id: 977b53d3f398f5ce457569ec01000000
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m pid: 3491208
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m name: sglang_server_1_0
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m namespace: 98d581c9-a558-40aa-89ac-c76103ef6fa4
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ip: 10.21.175.69
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): �[36mray::AgentLoopWorker.generate_sequences()�[39m (pid=3498641, ip=10.21.175.69, actor_id=64eaea39079f0c5e36947c3001000000, repr=<verl.experimental.agent_loop.agent_loop.AgentLoopWorker object at 0x7ed383f01f70>)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/concurrent/futures/_base.py", line 456, in result
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m return self.__get_result()
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m raise self._exception
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/utils/transferqueue_utils.py", line 319, in dummy_async_inner
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output = await func(*args, **kwargs)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 479, in generate_sequences
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m outputs = await asyncio.gather(*tasks)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 516, in _run_agent_loop
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output: AgentLoopOutput = await agent_loop.run(sampling_params, **kwargs)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/single_turn_agent_loop.py", line 59, in run
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output = await self.server_manager.generate(
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/utils/rollout_trace.py", line 191, in async_wrapper
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m return await func(self, *args, **kwargs)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 115, in generate
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output = await server.generate.remote(
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m class_name: SGLangHttpServer
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m actor_id: 977b53d3f398f5ce457569ec01000000
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m pid: 3491208
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m name: sglang_server_1_0
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m namespace: 98d581c9-a558-40aa-89ac-c76103ef6fa4
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ip: 10.21.175.69
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): �[36mray::AgentLoopWorker.generate_sequences()�[39m (pid=3498639, ip=10.21.175.69, actor_id=fbc885410117aa86270c752b01000000, repr=<verl.experimental.agent_loop.agent_loop.AgentLoopWorker object at 0x7eed19474530>)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/concurrent/futures/_base.py", line 456, in result
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m return self.__get_result()
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m raise self._exception
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/utils/transferqueue_utils.py", line 319, in dummy_async_inner
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output = await func(*args, **kwargs)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 479, in generate_sequences
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m outputs = await asyncio.gather(*tasks)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 516, in _run_agent_loop
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output: AgentLoopOutput = await agent_loop.run(sampling_params, **kwargs)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/single_turn_agent_loop.py", line 59, in run
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output = await self.server_manager.generate(
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/utils/rollout_trace.py", line 191, in async_wrapper
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m return await func(self, *args, **kwargs)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 115, in generate
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output = await server.generate.remote(
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m class_name: SGLangHttpServer
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m actor_id: 977b53d3f398f5ce457569ec01000000
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m pid: 3491208
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m name: sglang_server_1_0
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m namespace: 98d581c9-a558-40aa-89ac-c76103ef6fa4
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ip: 10.21.175.69
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): �[36mray::AgentLoopWorker.generate_sequences()�[39m (pid=3498661, ip=10.21.175.69, actor_id=2565191637a99b1e9f0a924601000000, repr=<verl.experimental.agent_loop.agent_loop.AgentLoopWorker object at 0x7f6c389a7f50>)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/concurrent/futures/_base.py", line 456, in result
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m return self.__get_result()
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m raise self._exception
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/utils/transferqueue_utils.py", line 319, in dummy_async_inner
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output = await func(*args, **kwargs)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 479, in generate_sequences
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m outputs = await asyncio.gather(*tasks)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 516, in _run_agent_loop
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output: AgentLoopOutput = await agent_loop.run(sampling_params, **kwargs)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/single_turn_agent_loop.py", line 59, in run
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output = await self.server_manager.generate(
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/utils/rollout_trace.py", line 191, in async_wrapper
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m return await func(self, *args, **kwargs)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 115, in generate
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output = await server.generate.remote(
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m class_name: SGLangHttpServer
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m actor_id: 977b53d3f398f5ce457569ec01000000
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m pid: 3491208
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m name: sglang_server_1_0
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m namespace: 98d581c9-a558-40aa-89ac-c76103ef6fa4
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ip: 10.21.175.69
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): �[36mray::AgentLoopWorker.generate_sequences()�[39m (pid=3498644, ip=10.21.175.69, actor_id=e1d295fd4461038128d1e75e01000000, repr=<verl.experimental.agent_loop.agent_loop.AgentLoopWorker object at 0x7f1e3374b5c0>)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/concurrent/futures/_base.py", line 449, in result
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m return self.__get_result()
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m raise self._exception
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/utils/transferqueue_utils.py", line 319, in dummy_async_inner
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output = await func(*args, **kwargs)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 479, in generate_sequences
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m outputs = await asyncio.gather(*tasks)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 516, in _run_agent_loop
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output: AgentLoopOutput = await agent_loop.run(sampling_params, **kwargs)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/single_turn_agent_loop.py", line 59, in run
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output = await self.server_manager.generate(
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/utils/rollout_trace.py", line 191, in async_wrapper
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m return await func(self, *args, **kwargs)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 115, in generate
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output = await server.generate.remote(
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m class_name: SGLangHttpServer
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m actor_id: 977b53d3f398f5ce457569ec01000000
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m pid: 3491208
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m name: sglang_server_1_0
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m namespace: 98d581c9-a558-40aa-89ac-c76103ef6fa4
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ip: 10.21.175.69
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): �[36mray::AgentLoopWorker.generate_sequences()�[39m (pid=3498642, ip=10.21.175.69, actor_id=96fc92d0a5f305f7c543b96a01000000, repr=<verl.experimental.agent_loop.agent_loop.AgentLoopWorker object at 0x7f50e22dbe00>)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/concurrent/futures/_base.py", line 456, in result
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m return self.__get_result()
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m raise self._exception
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/utils/transferqueue_utils.py", line 319, in dummy_async_inner
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output = await func(*args, **kwargs)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 479, in generate_sequences
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m outputs = await asyncio.gather(*tasks)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 516, in _run_agent_loop
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output: AgentLoopOutput = await agent_loop.run(sampling_params, **kwargs)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/single_turn_agent_loop.py", line 59, in run
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output = await self.server_manager.generate(
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/utils/rollout_trace.py", line 191, in async_wrapper
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m return await func(self, *args, **kwargs)
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 115, in generate
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m output = await server.generate.remote(
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m class_name: SGLangHttpServer
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m actor_id: 977b53d3f398f5ce457569ec01000000
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m pid: 3491208
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m name: sglang_server_1_0
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m namespace: 98d581c9-a558-40aa-89ac-c76103ef6fa4
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m ip: 10.21.175.69
[2026-03-12 11:50:36] �[36m(TaskRunner pid=3483799)�[0m The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
[2026-03-12 11:50:36] �[36m(SGLangHttpServer pid=3491208)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
[2026-03-12 11:50:36] �[36m(SGLangHttpServer pid=3491208)�[0m warnings.warn('resource_tracker: There appear to be %d '
[2026-03-12 11:50:36] �[36m(SGLangHttpServer pid=3491208)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
[2026-03-12 11:50:36] �[36m(SGLangHttpServer pid=3491208)�[0m warnings.warn('resource_tracker: There appear to be %d '
[2026-03-12 11:50:37] �[36m(TaskRunner pid=3483799)�[0m wandb: uploading console lines 6-12; updating run metadata; uploading output.log; uploading wandb-summary.json
[2026-03-12 11:50:38] �[36m(TaskRunner pid=3483799)�[0m wandb: uploading config.yaml
[2026-03-12 11:50:39] �[36m(TaskRunner pid=3483799)�[0m wandb: uploading history steps 0-0, summary, console lines 12-45
[2026-03-12 11:50:39] �[36m(AgentLoopWorker pid=3498642)�[0m "type": "function",�[32m [repeated 1315x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(AgentLoopWorker pid=3498642)�[0m "function": {�[32m [repeated 1315x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(AgentLoopWorker pid=3498642)�[0m "name": "calc_gsm8k_reward",�[32m [repeated 1315x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(AgentLoopWorker pid=3498642)�[0m "description": "A tool for calculating the reward of gsm8k. (1.0 if parsed answer is correct, 0.0 if parsed answer is incorrect or not correctly parsed)",�[32m [repeated 1315x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(AgentLoopWorker pid=3498642)�[0m "parameters": {�[32m [repeated 1315x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(AgentLoopWorker pid=3498642)�[0m "type": "object",�[32m [repeated 1315x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(AgentLoopWorker pid=3498642)�[0m "properties": {�[32m [repeated 1315x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(AgentLoopWorker pid=3498642)�[0m "answer": {�[32m [repeated 1315x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(AgentLoopWorker pid=3498642)�[0m "type": "string",�[32m [repeated 1315x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(AgentLoopWorker pid=3498642)�[0m "description": "The model's answer to the GSM8K math problem, must be a digits"�[32m [repeated 1315x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(AgentLoopWorker pid=3498642)�[0m }�[32m [repeated 5260x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(AgentLoopWorker pid=3498642)�[0m },�[32m [repeated 1315x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(AgentLoopWorker pid=3498642)�[0m "required": [�[32m [repeated 1315x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(AgentLoopWorker pid=3498642)�[0m "answer"�[32m [repeated 1315x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(AgentLoopWorker pid=3498642)�[0m ]�[32m [repeated 1315x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(AgentLoopWorker pid=3498641)�[0m
[2026-03-12 11:50:39] �[33m(raylet)�[0m A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffff977b53d3f398f5ce457569ec01000000 Worker ID: 4622ada769736ea4442d85feb83b188afea4ae081c8ce5ebbdb87d50 Node ID: d0841e29917e2068d0ca53ca3999d0c231079e28b58729f09903867e Worker IP address: 10.21.175.69 Worker port: 35505 Worker PID: 3491208 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
[2026-03-12 11:50:39] Error executing job with overrides: ['algorithm.adv_estimator=grpo', 'data.train_batch_size=256', 'data.max_prompt_length=1024', 'data.max_response_length=1024', 'data.filter_overlong_prompts=True', 'data.truncation=error', 'data.return_raw_chat=True', 'actor_rollout_ref.model.path=/appdata/zhangkailin/models/Qwen/Qwen3-4B', 'actor_rollout_ref.actor.optim.lr=1e-6', 'actor_rollout_ref.model.use_remove_padding=True', 'actor_rollout_ref.actor.ppo_mini_batch_size=256', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=32', 'actor_rollout_ref.actor.use_kl_loss=True', 'actor_rollout_ref.actor.kl_loss_coef=0.001', 'actor_rollout_ref.actor.kl_loss_type=low_var_kl', 'actor_rollout_ref.actor.entropy_coeff=0', 'actor_rollout_ref.model.enable_gradient_checkpointing=True', 'actor_rollout_ref.actor.fsdp_config.param_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=sglang', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.5', 'actor_rollout_ref.rollout.n=16', 'actor_rollout_ref.rollout.over_sample_rate=0.1', 'actor_rollout_ref.rollout.mode=async', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'algorithm.use_kl_in_reward=False', 'trainer.critic_warmup=0', 'trainer.logger=["console","wandb"]', 'trainer.project_name=gsm8k_async_rl', 'trainer.experiment_name=qwen3-4b_function_rm-gsm8k-sgl-multi-w-tool-verify-n16', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=1', 'trainer.save_freq=-1', 'trainer.test_freq=20', 'data.train_files=/appdata/zhangkailin/verl/examples/data_preprocess/save_data/gsm8k/train.parquet', 'data.val_files=/appdata/zhangkailin/verl/examples/data_preprocess/save_data/gsm8k/test.parquet', 'actor_rollout_ref.rollout.multi_turn.tool_config_path=/appdata/zhangkailin/verl-new/examples/sglang_multiturn/config/tool_config/gsm8k_tool_config.yaml', 'trainer.total_epochs=15']
[2026-03-12 11:50:39] Traceback (most recent call last):
[2026-03-12 11:50:39] File "", line 198, in _run_module_as_main
[2026-03-12 11:50:39] File "", line 88, in _run_code
[2026-03-12 11:50:39] File "/appdata/zhangkailin/verl-new/verl/trainer/main_ppo.py", line 448, in
[2026-03-12 11:50:39] main()
[2026-03-12 11:50:39] File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/hydra/main.py", line 94, in decorated_main
[2026-03-12 11:50:39] _run_hydra(
[2026-03-12 11:50:39] File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
[2026-03-12 11:50:39] _run_app(
[2026-03-12 11:50:39] File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/hydra/_internal/utils.py", line 457, in _run_app
[2026-03-12 11:50:39] run_and_report(
[2026-03-12 11:50:39] File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
[2026-03-12 11:50:39] raise ex
[2026-03-12 11:50:39] File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
[2026-03-12 11:50:39] return func()
[2026-03-12 11:50:39] ^^^^^^
[2026-03-12 11:50:39] File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/hydra/_internal/utils.py", line 458, in
[2026-03-12 11:50:39] lambda: hydra.run(
[2026-03-12 11:50:39] ^^^^^^^^^^
[2026-03-12 11:50:39] File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/hydra/_internal/hydra.py", line 132, in run
[2026-03-12 11:50:39] _ = ret.return_value
[2026-03-12 11:50:39] ^^^^^^^^^^^^^^^^
[2026-03-12 11:50:39] File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/hydra/core/utils.py", line 260, in return_value
[2026-03-12 11:50:39] raise self._return_value
[2026-03-12 11:50:39] File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/hydra/core/utils.py", line 186, in run_job
[2026-03-12 11:50:39] ret.return_value = task_function(task_cfg)
[2026-03-12 11:50:39] ^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:39] File "/appdata/zhangkailin/verl-new/verl/trainer/main_ppo.py", line 45, in main
[2026-03-12 11:50:39] run_ppo(config)
[2026-03-12 11:50:39] File "/appdata/zhangkailin/verl-new/verl/trainer/main_ppo.py", line 99, in run_ppo
[2026-03-12 11:50:39] ray.get(runner.run.remote(config))
[2026-03-12 11:50:39] File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
[2026-03-12 11:50:39] return fn(*args, **kwargs)
[2026-03-12 11:50:39] ^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:39] File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
[2026-03-12 11:50:39] return func(*args, **kwargs)
[2026-03-12 11:50:39] ^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:39] File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/_private/worker.py", line 2882, in get
[2026-03-12 11:50:39] values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
[2026-03-12 11:50:39] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:39] File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/_private/worker.py", line 968, in get_objects
[2026-03-12 11:50:39] raise value.as_instanceof_cause()
[2026-03-12 11:50:39] ray.exceptions.RayTaskError(ActorDiedError): �[36mray::TaskRunner.run()�[39m (pid=3483799, ip=10.21.175.69, actor_id=56aa6240a7f227ebf2ea2b0f01000000, repr=<main_ppo.TaskRunner object at 0x7f96468feed0>)
[2026-03-12 11:50:39] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:39] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:39] File "/appdata/zhangkailin/verl-new/verl/trainer/main_ppo.py", line 367, in run
[2026-03-12 11:50:39] trainer.fit()
[2026-03-12 11:50:39] File "/appdata/zhangkailin/verl-new/verl/trainer/ppo/ray_trainer.py", line 1439, in fit
[2026-03-12 11:50:39] gen_batch_output = self.async_rollout_manager.generate_sequences(gen_batch_output)
[2026-03-12 11:50:39] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:39] File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 944, in generate_sequences
[2026-03-12 11:50:39] outputs = ray.get(
[2026-03-12 11:50:39] ^^^^^^^^
[2026-03-12 11:50:39] ^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:39] ^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:39] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:39] ray.exceptions.RayTaskError(ActorDiedError): �[36mray::AgentLoopWorker.generate_sequences()�[39m (pid=3498640, ip=10.21.175.69, actor_id=5e4e19c6ffb03addf2c4e77601000000, repr=<verl.experimental.agent_loop.agent_loop.AgentLoopWorker object at 0x7fc57a7c0920>)
[2026-03-12 11:50:39] File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/concurrent/futures/_base.py", line 456, in result
[2026-03-12 11:50:39] return self.__get_result()
[2026-03-12 11:50:39] ^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:39] File "/home/zhangkailin/.conda/envs/verl2/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
[2026-03-12 11:50:39] raise self._exception
[2026-03-12 11:50:39] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:39] File "/appdata/zhangkailin/verl-new/verl/utils/transferqueue_utils.py", line 319, in dummy_async_inner
[2026-03-12 11:50:39] output = await func(*args, **kwargs)
[2026-03-12 11:50:39] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:39] File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 479, in generate_sequences
[2026-03-12 11:50:39] outputs = await asyncio.gather(*tasks)
[2026-03-12 11:50:39] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:39] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:39] File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 516, in _run_agent_loop
[2026-03-12 11:50:39] output: AgentLoopOutput = await agent_loop.run(sampling_params, **kwargs)
[2026-03-12 11:50:39] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:39] File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/single_turn_agent_loop.py", line 59, in run
[2026-03-12 11:50:39] output = await self.server_manager.generate(
[2026-03-12 11:50:39] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:39] File "/appdata/zhangkailin/verl-new/verl/utils/rollout_trace.py", line 191, in async_wrapper
[2026-03-12 11:50:39] return await func(self, *args, **kwargs)
[2026-03-12 11:50:39] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:39] File "/appdata/zhangkailin/verl-new/verl/experimental/agent_loop/agent_loop.py", line 115, in generate
[2026-03-12 11:50:39] output = await server.generate.remote(
[2026-03-12 11:50:39] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-12 11:50:39] ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
[2026-03-12 11:50:39] class_name: SGLangHttpServer
[2026-03-12 11:50:39] actor_id: 977b53d3f398f5ce457569ec01000000
[2026-03-12 11:50:39] pid: 3491208
[2026-03-12 11:50:39] name: sglang_server_1_0
[2026-03-12 11:50:39] namespace: 98d581c9-a558-40aa-89ac-c76103ef6fa4
[2026-03-12 11:50:39] ip: 10.21.175.69
[2026-03-12 11:50:39] The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
[2026-03-12 11:50:39] �[36m(TaskRunner pid=3483799)�[0m wandb:
[2026-03-12 11:50:39] �[36m(TaskRunner pid=3483799)�[0m wandb: Run history:
[2026-03-12 11:50:39] �[36m(TaskRunner pid=3483799)�[0m wandb: val-aux/num_turns/max ▁
[2026-03-12 11:50:39] �[36m(TaskRunner pid=3483799)�[0m wandb: val-aux/num_turns/mean ▁
[2026-03-12 11:50:39] �[36m(TaskRunner pid=3483799)�[0m wandb: val-aux/num_turns/min ▁
[2026-03-12 11:50:39] �[36m(TaskRunner pid=3483799)�[0m wandb: val-aux/openai/gsm8k/reward/mean@1 ▁
[2026-03-12 11:50:39] �[36m(TaskRunner pid=3483799)�[0m wandb: val-core/openai/gsm8k/acc/mean@1 ▁
[2026-03-12 11:50:39] �[36m(TaskRunner pid=3483799)�[0m wandb:
[2026-03-12 11:50:39] �[36m(TaskRunner pid=3483799)�[0m wandb: Run summary:
[2026-03-12 11:50:39] �[36m(TaskRunner pid=3483799)�[0m wandb: val-aux/num_turns/max 2
[2026-03-12 11:50:39] �[36m(TaskRunner pid=3483799)�[0m wandb: val-aux/num_turns/mean 2
[2026-03-12 11:50:39] �[36m(TaskRunner pid=3483799)�[0m wandb: val-aux/num_turns/min 2
[2026-03-12 11:50:39] �[36m(TaskRunner pid=3483799)�[0m wandb: val-aux/openai/gsm8k/reward/mean@1 0.80212
[2026-03-12 11:50:39] �[36m(TaskRunner pid=3483799)�[0m wandb: val-core/openai/gsm8k/acc/mean@1 0.80212
[2026-03-12 11:50:39] �[36m(TaskRunner pid=3483799)�[0m wandb:
[2026-03-12 11:50:39] �[36m(TaskRunner pid=3483799)�[0m wandb: 🚀 View run qwen3-4b_function_rm-gsm8k-sgl-multi-w-tool-verify-n16 at: https://wandb.ai/zhangkailin1002-genn/gsm8k_async_rl/runs/6mf3v32s
[2026-03-12 11:50:39] �[36m(TaskRunner pid=3483799)�[0m wandb: ⭐️ View project at: https://wandb.ai/zhangkailin1002-genn/gsm8k_async_rl
[2026-03-12 11:50:39] �[36m(TaskRunner pid=3483799)�[0m wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
[2026-03-12 11:50:39] �[36m(TaskRunner pid=3483799)�[0m wandb: Find logs at: ./wandb/run-20260312_114948-6mf3v32s/logs
[2026-03-12 11:50:39] �[36m(TaskRunner pid=3483799)�[0m
Training Progress: 0%| | 0/435 [00:12<?, ?it/s]
[2026-03-12 11:50:39] �[36m(SGLangHttpServer pid=3491207)�[0m [2026-03-12 11:50:36,145 C 3491207 3491654] task_receiver.cc:181: An unexpected system state has occurred. You have likely discovered a bug in Ray. Please report this issue at https://github.com/ray-project/ray/issues and we'll work with you to fix it. Check failed: objects_valid �[32m [repeated 3x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(SGLangHttpServer pid=3491207)�[0m *** StackTrace Information ***�[32m [repeated 3x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(SGLangHttpServer pid=3491207)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/_raylet.so(+0x152da9a) [0x7f5199ba1a9a] ray::operator<<()�[32m [repeated 3x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(SGLangHttpServer pid=3491207)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/_raylet.so(_ZN3ray6RayLogD1Ev+0x481) [0x7f5199ba3e81] ray::RayLog::~RayLog()�[32m [repeated 3x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(SGLangHttpServer pid=3491207)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/_raylet.so(+0xa3ec92) [0x7f51990b2c92] ray::core::InboundRequest::Accept()�[32m [repeated 3x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(SGLangHttpServer pid=3491207)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/_raylet.so(_ZN3ray4core30OutOfOrderActorSchedulingQueue31AcceptRequestOrRejectIfCanceledENS_6TaskIDERNS0_14InboundRequestE+0x11c) [0x7f51990a7a9c] ray::core::OutOfOrderActorSchedulingQueue::AcceptRequestOrRejectIfCanceled()�[32m [repeated 3x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(SGLangHttpServer pid=3491207)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/_raylet.so(+0xa31717) [0x7f51990a5717] std::_Function_handler<>::_M_invoke()�[32m [repeated 6x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(SGLangHttpServer pid=3491207)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/raylet.so(+0xa361c5) [0x7f51990aa1c5] boost::fibers::worker_context<>::run()�[32m [repeated 3x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(SGLangHttpServer pid=3491207)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/_raylet.so(+0xa35e10) [0x7f51990a9e10] boost::context::detail::fiber_entry<>()�[32m [repeated 3x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(SGLangHttpServer pid=3491207)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/site-packages/ray/_raylet.so(+0xa3e8ff) [0x7f51990b28ff] make_fcontext�[32m [repeated 3x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(SGLangHttpServer pid=3491207)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown�[32m [repeated 3x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(SGLangHttpServer pid=3491207)�[0m warnings.warn('resource_tracker: There appear to be %d '�[32m [repeated 6x across cluster]�[0m
[2026-03-12 11:50:39] �[36m(SGLangHttpServer pid=3491207)�[0m /home/zhangkailin/.conda/envs/verl2/lib/python3.12/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown�[32m [repeated 3x across cluster]�[0m
[2026-03-12 11:50:41] �[36m(AgentLoopWorker pid=3498640)�[0m {�[32m [repeated 2781x across cluster]�[0m
[2026-03-12 11:50:41] �[36m(AgentLoopWorker pid=3498640)�[0m "type": "function",�[32m [repeated 2781x across cluster]�[0m
[2026-03-12 11:50:41] �[36m(AgentLoopWorker pid=3498640)�[0m "function": {�[32m [repeated 2781x across cluster]�[0m
[2026-03-12 11:50:41] �[36m(AgentLoopWorker pid=3498640)�[0m "name": "calc_gsm8k_reward",�[32m [repeated 2781x across cluster]�[0m
[2026-03-12 11:50:41] �[36m(AgentLoopWorker pid=3498640)�[0m "description": "A tool for calculating the reward of gsm8k. (1.0 if parsed answer is correct, 0.0 if parsed answer is incorrect or not correctly parsed)",�[32m [repeated 2781x across cluster]�[0m
[2026-03-12 11:50:41] �[36m(AgentLoopWorker pid=3498640)�[0m "parameters": {�[32m [repeated 2781x across cluster]�[0m
[2026-03-12 11:50:41] �[36m(AgentLoopWorker pid=3498640)�[0m "type": "object",�[32m [repeated 2781x across cluster]�[0m
[2026-03-12 11:50:41] �[36m(AgentLoopWorker pid=3498640)�[0m "properties": {�[32m [repeated 2781x across cluster]�[0m
[2026-03-12 11:50:41] �[36m(AgentLoopWorker pid=3498640)�[0m "answer": {�[32m [repeated 2781x across cluster]�[0m
[2026-03-12 11:50:41] �[36m(AgentLoopWorker pid=3498640)�[0m "type": "string",�[32m [repeated 2781x across cluster]�[0m
[2026-03-12 11:50:41] �[36m(AgentLoopWorker pid=3498640)�[0m "description": "The model's answer to the GSM8K math problem, must be a digits"�[32m [repeated 2781x across cluster]�[0m
[2026-03-12 11:50:41] �[36m(AgentLoopWorker pid=3498640)�[0m }�[32m [repeated 11124x across cluster]�[0m
[2026-03-12 11:50:41] �[36m(AgentLoopWorker pid=3498640)�[0m },�[32m [repeated 2781x across cluster]�[0m
[2026-03-12 11:50:41] �[36m(AgentLoopWorker pid=3498640)�[0m "required": [�[32m [repeated 2781x across cluster]�[0m
[2026-03-12 11:50:41] �[36m(AgentLoopWorker pid=3498640)�[0m "answer"�[32m [repeated 2781x across cluster]�[0m
[2026-03-12 11:50:41] �[36m(AgentLoopWorker pid=3498640)�[0m ]�[32m [repeated 2781x across cluster]�[0m
[2026-03-12 11:50:41] �[33m(raylet)�[0m A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffdac23f3311d8ca54abeea54101000000 Worker ID: 07f4d6d1b64ee555c2057c793b0ba9b1d01cbee160480cd12e8fc01c Node ID: d0841e29917e2068d0ca53ca3999d0c231079e28b58729f09903867e Worker IP address: 10.21.175.69 Worker port: 35325 Worker PID: 3491207 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.�[32m [repeated 3x across cluster]�[0m