Skip to content

[Question] 运行在线 RL 训练的时候,出现如下错误,无法启动 #1065

@Awyshw

Description

@Awyshw

(AReaL) root@ai:/ai/shenwei/workspace/codes/AReaL# python examples/openclaw/train.py --config examples/openclaw/config.yaml experiment_name=my-exp trial_name=trial-0 allocation_mode=sglang:d2t1+fsdp:d2t1 actor.pat
h=Qwen/Qwen3-4B scheduler.type=local rollout.openai.admin_api_key=sk-test123456
运行上述问题异常退出,提示没有 psutil 模块,但是这块如果没有捕捉到异常,修改 proc.py代码捕捉到不管他,依然会出错。

(AReaL) 20260320-11:01:20.883 CLIArgs WARNING: behave_imp_weight_cap and behave_imp_weight_mode are configured but use_decoupled_loss=False. These settings will be ignored. Set use_decoupled_loss=True to enable decoupled loss with importance weight correction.
(AReaL) 20260320-11:01:28.732 PlatformInit INFO: Detected CUDA device: NVIDIA A800 80GB PCIE
(AReaL) 20260320-11:01:28.742 PlatformInit INFO: Initializing CUDA platform (NVIDIA).
(AReaL) 20260320-11:01:28.857 NameResolve INFO: Removing name resolve path: /ai/shenwei/workspace/codes/areal-experiments/name_resolve/root/my-exp/trial-0
(AReaL) 20260320-11:01:30.011 LocalScheduler INFO: LocalScheduler initialized with GPU devices: [0, 1, 2, 3], log directory: /ai/shenwei/workspace/codes/areal-experiments/clawrl/logs/root/my-exp/trial-0
/ai/shenwei/workspace/codes/AReaL/areal/engine/fsdp_utils/grad.py:35: UserWarning: Transformer Engine and Apex are not installed. Falling back to local implementations of multi_tensor_applier, multi_tensor_l2norm, and multi_tensor_scale
warnings.warn(
(AReaL) 20260320-11:02:36.185 TreeAttentionFSDP INFO: Compiled torch flex attention. Options: {'epilogue_fusion': True, 'max_autotune': False, 'shape_padding': True, 'trace.enabled': False, 'triton.cudagraphs': False}, dynamic: True
(AReaL) 20260320-11:02:36.452 TreeAttentionFSDP INFO: Using block mask in flex attention, block size: 128
[W320 11:02:43.881458139 socket.cpp:209] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(AReaL) 20260320-11:02:43.020 TrainController INFO: Creating workers via scheduler...
(AReaL) 20260320-11:02:43.021 LocalScheduler INFO: Creating 2 workers for role 'actor' (strategy: SchedulingStrategyType.separation, colocate_with: None)
(AReaL) 20260320-11:02:43.022 LauncherUtils INFO: Auto-setting thread env vars to 8: OMP_NUM_THREADS, MKL_NUM_THREADS, OPENBLAS_NUM_THREADS, VECLIB_MAXIMUM_THREADS, NUMEXPR_NUM_THREADS
(AReaL) 20260320-11:02:43.023 LocalScheduler INFO: Starting worker actor/0: python3 -m areal.infra.rpc.rpc_server --port 7851 --experiment-name my-exp --trial-name trial-0 --role actor --worker-index 0 --name-resolve-type nfs --nfs-record-root /ai/shenwei/workspace/codes/areal-experiments/name_resolve --etcd3-addr localhost:2379 --fileroot /ai/shenwei/workspace/codes/areal-experiments/clawrl
(AReaL) 20260320-11:02:43.136 LocalScheduler INFO: Worker actor/0 started (PID: 1487796, GPUs: [0], ports: [7851, 34213])
(AReaL) 20260320-11:02:43.138 LauncherUtils INFO: Auto-setting thread env vars to 8: OMP_NUM_THREADS, MKL_NUM_THREADS, OPENBLAS_NUM_THREADS, VECLIB_MAXIMUM_THREADS, NUMEXPR_NUM_THREADS
(AReaL) 20260320-11:02:43.139 LocalScheduler INFO: Starting worker actor/1: python3 -m areal.infra.rpc.rpc_server --port 16368 --experiment-name my-exp --trial-name trial-0 --role actor --worker-index 1 --name-resolve-type nfs --nfs-record-root /ai/shenwei/workspace/codes/areal-experiments/name_resolve --etcd3-addr localhost:2379 --fileroot /ai/shenwei/workspace/codes/areal-experiments/clawrl
(AReaL) 20260320-11:02:43.255 LocalScheduler INFO: Worker actor/1 started (PID: 1487801, GPUs: [1], ports: [16368, 56139])
(AReaL) 20260320-11:02:43.256 LocalScheduler INFO: Successfully created 2 workers for role 'actor'

(AReaL) 20260320-11:17:44.504 SyncRPCServer INFO: Starting sync RPC server on 162.30.1.19:16368 for worker actor/1
(AReaL) 20260320-11:17:44.504 SyncRPCServer INFO: Werkzeug log level: WARNING
(AReaL) 20260320-11:17:44.590 SyncRPCServer INFO: Starting sync RPC server on 162.30.1.19:7851 for worker actor/0
(AReaL) 20260320-11:17:44.590 SyncRPCServer INFO: Werkzeug log level: WARNING
(AReaL) 20260320-11:17:44.606 SyncRPCServer INFO: Engine thread started
(AReaL) 20260320-11:17:44.607 SyncRPCServer INFO: Engine thread initialized
(AReaL) 20260320-11:17:45.067 PlatformInit INFO: Detected CUDA device: NVIDIA A800 80GB PCIE
(AReaL) 20260320-11:17:45.067 PlatformInit INFO: Initializing CUDA platform (NVIDIA).
(AReaL) 20260320-11:17:45.070 LocalScheduler INFO: Configuration successfully on worker 'actor/0'
(AReaL) 20260320-11:17:45.087 SyncRPCServer INFO: Engine thread started
(AReaL) 20260320-11:17:45.088 SyncRPCServer INFO: Engine thread initialized
(AReaL) 20260320-11:17:45.187 PlatformInit INFO: Detected CUDA device: NVIDIA A800 80GB PCIE
(AReaL) 20260320-11:17:45.187 PlatformInit INFO: Initializing CUDA platform (NVIDIA).
(AReaL) 20260320-11:17:45.190 LocalScheduler INFO: Configuration successfully on worker 'actor/1'
(AReaL) 20260320-11:17:45.192 TrainController INFO: Workers created: ['actor/0', 'actor/1']
(AReaL) 20260320-11:17:45.192 TrainController INFO: Waiting for workers to be ready...
(AReaL) 20260320-11:17:45.204 LocalScheduler INFO: All 2 workers for role 'actor' are ready
(AReaL) 20260320-11:17:45.205 TrainController INFO: Workers ready: ['actor/0', 'actor/1']
(AReaL) 20260320-11:17:45.206 TrainController INFO: Distributed training: MASTER_ADDR=162.30.1.19, MASTER_PORT=34213
(AReaL) 20260320-11:17:45.208 TrainController INFO: Creating engines on workers...
(AReaL) 20260320-11:17:45.213 SyncRPCServer INFO: Set RANK=0
(AReaL) 20260320-11:17:45.213 SyncRPCServer INFO: Set RANK=1
(AReaL) 20260320-11:17:45.214 SyncRPCServer INFO: Set WORLD_SIZE=2
(AReaL) 20260320-11:17:45.214 SyncRPCServer INFO: Set MASTER_ADDR=162.30.1.19
(AReaL) 20260320-11:17:45.214 SyncRPCServer INFO: Set WORLD_SIZE=2
(AReaL) 20260320-11:17:45.214 SyncRPCServer INFO: Set MASTER_PORT=34213
(AReaL) 20260320-11:17:45.214 SyncRPCServer INFO: Set LOCAL_RANK=0
(AReaL) 20260320-11:17:45.214 SyncRPCServer INFO: Set MASTER_ADDR=162.30.1.19
(AReaL) 20260320-11:17:45.214 SyncRPCServer INFO: Set MASTER_PORT=34213
(AReaL) 20260320-11:17:45.215 SyncRPCServer INFO: Set LOCAL_RANK=0
/ai/shenwei/workspace/codes/AReaL/areal/engine/fsdp_utils/grad.py:35: UserWarning: Transformer Engine and Apex are not installed. Falling back to local implementations of multi_tensor_applier, multi_tensor_l2norm, and multi_tensor_scale
warnings.warn(
/ai/shenwei/workspace/codes/AReaL/areal/engine/fsdp_utils/grad.py:35: UserWarning: Transformer Engine and Apex are not installed. Falling back to local implementations of multi_tensor_applier, multi_tensor_l2norm, and multi_tensor_scale
warnings.warn(
(AReaL) 20260320-11:19:27.487 TreeAttentionFSDP INFO: Compiled torch flex attention. Options: {'epilogue_fusion': True, 'max_autotune': False, 'shape_padding': True, 'trace.enabled': False, 'triton.cudagraphs': False}, dynamic: True
(AReaL) 20260320-11:19:27.490 TreeAttentionFSDP INFO: Compiled torch flex attention. Options: {'epilogue_fusion': True, 'max_autotune': False, 'shape_padding': True, 'trace.enabled': False, 'triton.cudagraphs': False}, dynamic: True
(AReaL) 20260320-11:19:27.543 TreeAttentionFSDP INFO: Using block mask in flex attention, block size: 128
(AReaL) 20260320-11:19:27.544 TreeAttentionFSDP INFO: Using block mask in flex attention, block size: 128
(AReaL) 20260320-11:19:29.152 PPOActor INFO: ======================================================================
(AReaL) 20260320-11:19:29.153 PPOActor INFO: PPOActor Configuration
(AReaL) 20260320-11:19:29.153 PPOActor INFO: ======================================================================
(AReaL) 20260320-11:19:29.153 PPOActor INFO: Mode: Decoupled PPO (off-policy)
(AReaL) 20260320-11:19:29.153 PPOActor INFO: log_p_behave (π_behave): FROM INFERENCE (behavior policy)
(AReaL) 20260320-11:19:29.153 PPOActor INFO: Proximal policy (π_prox): RECOMPUTED via forward pass (standard decoupled PPO)
(AReaL) 20260320-11:19:29.153 PPOActor INFO: log_p_theta (π_θ): TRAINING FORWARD PASS (current policy)
(AReaL) 20260320-11:19:29.153 PPOActor INFO: Importance weight cap: 5.0 (filters out tokens with extreme weights)
(AReaL) 20260320-11:19:29.153 PPOActor INFO: ======================================================================
(AReaL) 20260320-11:19:29.153 PPOActor INFO: ======================================================================
(AReaL) 20260320-11:19:29.154 PPOActor INFO: Training Parameters:
(AReaL) 20260320-11:19:29.154 PPOActor INFO: importance_sampling_level: token
(AReaL) 20260320-11:19:29.154 PPOActor INFO: adv_norm: NormConfig(mean_level='batch', mean_leave1out=False, std_level='batch', std_unbiased=True, eps=1e-05, group_size=1)
(AReaL) 20260320-11:19:29.154 PPOActor INFO: PPOActor Configuration
(AReaL) 20260320-11:19:29.154 PPOActor INFO: ======================================================================
(AReaL) 20260320-11:19:29.154 PPOActor INFO: Mode: Decoupled PPO (off-policy)
(AReaL) 20260320-11:19:29.154 PPOActor INFO: log_p_behave (π_behave): FROM INFERENCE (behavior policy)
(AReaL) 20260320-11:19:29.154 PPOActor INFO: Proximal policy (π_prox): RECOMPUTED via forward pass (standard decoupled PPO)
(AReaL) 20260320-11:19:29.154 PPOActor INFO: log_p_theta (π_θ): TRAINING FORWARD PASS (current policy)
(AReaL) 20260320-11:19:29.154 PPOActor INFO: Importance weight cap: 5.0 (filters out tokens with extreme weights)
(AReaL) 20260320-11:19:29.154 PPOActor INFO: ======================================================================
(AReaL) 20260320-11:19:29.155 PPOActor INFO: Training Parameters:
(AReaL) 20260320-11:19:29.155 PPOActor INFO: importance_sampling_level: token
(AReaL) 20260320-11:19:29.155 PPOActor INFO: adv_norm: NormConfig(mean_level='batch', mean_leave1out=False, std_level='batch', std_unbiased=True, eps=1e-05, group_size=1)
(AReaL) 20260320-11:19:29.155 PPOActor INFO: reward_norm: NormConfig(mean_level='group', mean_leave1out=False, std_level='group', std_unbiased=True, eps=1e-05, group_size=1)
(AReaL) 20260320-11:19:29.155 PPOActor INFO: eps_clip: 0.4
(AReaL) 20260320-11:19:29.155 PPOActor INFO: ======================================================================
(AReaL) 20260320-11:19:29.155 SyncRPCServer INFO: Engine 'actor/0' (class: areal.engine.fsdp_engine.FSDPPPOActor) instantiated successfully
(AReaL) 20260320-11:19:29.154 PPOActor INFO: reward_norm: NormConfig(mean_level='group', mean_leave1out=False, std_level='group', std_unbiased=True, eps=1e-05, group_size=1)
(AReaL) 20260320-11:19:29.154 PPOActor INFO: eps_clip: 0.4
(AReaL) 20260320-11:19:29.154 PPOActor INFO: ======================================================================
(AReaL) 20260320-11:19:29.154 SyncRPCServer INFO: Engine 'actor/1' (class: areal.engine.fsdp_engine.FSDPPPOActor) instantiated successfully
(AReaL) 20260320-11:19:29.159 TrainController INFO: Engines created on all workers!
(AReaL) 20260320-11:19:29.162 TrainController INFO: Calling engine initialization...
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
(AReaL) 20260320-11:19:29.489 [FSDPEngine Rank 0] INFO: Initializing device mesh with parallel dims (dp=2, sp=1, tp=1, ep=1, etp=1, world_size=2).
(AReaL) 20260320-11:19:29.491 [FSDPEngine Rank 1] INFO: Initializing device mesh with parallel dims (dp=2, sp=1, tp=1, ep=1, etp=1, world_size=2).
(AReaL) 20260320-11:19:29.558 [FSDPEngine Rank 0] INFO: Data parallel head 0 and rank 0
(AReaL) 20260320-11:19:29.597 [FSDPEngine Rank 1] INFO: Data parallel head 1 and rank 1
(AReaL) 20260320-11:19:42.296 IOStruct INFO: Memory-Usage before model creation/loading: memory allocated (GB): 0.00, memory reserved (GB): 0.00, device memory used/total (GB): 1.22/79.32
Loading checkpoint shards: 100%|██████████| 3/3 [00:06<00:00, 2.14s/it]
(AReaL) 20260320-11:19:53.528 IOStruct INFO: Memory-Usage after model creation/loading: memory allocated (GB): 7.52, memory reserved (GB): 7.54, device memory used/total (GB): 8.76/79.32
(AReaL) 20260320-11:19:53.532 [FSDPEngine Rank 0] INFO: Model creation and loading time: 7.77s
(AReaL) 20260320-11:19:54.305 [FSDPEngine Rank 0] INFO: Applying FSDP2 with N-D parallelism for 0.77 seconds
(AReaL) 20260320-11:19:54.310 [FSDPEngine Rank 0] INFO: Create optimizer time: 0.004012462683022022
Loading checkpoint shards: 100%|██████████| 3/3 [00:11<00:00, 3.84s/it]
(AReaL) 20260320-11:19:58.793 [FSDPEngine Rank 1] INFO: Model creation and loading time: 13.05s
(AReaL) 20260320-11:19:59.737 [FSDPEngine Rank 1] INFO: Applying FSDP2 with N-D parallelism for 0.94 seconds
(AReaL) 20260320-11:19:59.742 [FSDPEngine Rank 1] INFO: Create optimizer time: 0.004047151654958725
(AReaL) 20260320-11:19:59.745 TrainController INFO: All engines are initialized!
(AReaL) 20260320-11:19:59.747 TrainController INFO: Identifying DP head workers...
(AReaL) 20260320-11:20:01.125 LocalScheduler WARNING: Method 'is_data_parallel_head' failed on worker 'actor/0' (attempt 1/3): Connection error: Server disconnected. Retrying in 1.0s...
(AReaL) 20260320-11:20:01.774 LocalScheduler WARNING: Method 'is_data_parallel_head' failed on worker 'actor/1' (attempt 1/3): Connection error: Server disconnected. Retrying in 1.0s...
[rank0]: Traceback (most recent call last):
[rank0]: File "/ai/shenwei/workspace/codes/AReaL/examples/openclaw/train.py", line 17, in
[rank0]: main(sys.argv[1:])
[rank0]: File "/ai/shenwei/workspace/codes/AReaL/examples/openclaw/train.py", line 12, in main
[rank0]: with PPOTrainer(config) as trainer:
[rank0]: ^^^^^^^^^^^^^^^^^^
[rank0]: File "/ai/shenwei/workspace/codes/AReaL/areal/trainer/rl_trainer.py", line 189, in init
[rank0]: self.actor.initialize(**engine_init_kwargs, role="actor")
[rank0]: File "/ai/shenwei/workspace/codes/AReaL/areal/infra/controller/train_controller.py", line 252, in initialize
[rank0]: self._identify_dp_heads()
[rank0]: File "/ai/shenwei/workspace/codes/AReaL/areal/infra/controller/train_controller.py", line 331, in _identify_dp_heads
[rank0]: self.workers_is_dp_head = run_async_task(_get_dp_head)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/ai/shenwei/workspace/codes/AReaL/areal/infra/utils/concurrent.py", line 69, in run_async_task
[rank0]: return asyncio.run(func(*args, **kwargs))
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/.local/share/uv/python/cpython-3.11.15-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run
[rank0]: return runner.run(main)
[rank0]: ^^^^^^^^^^^^^^^^
[rank0]: File "/root/.local/share/uv/python/cpython-3.11.15-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run
[rank0]: return self._loop.run_until_complete(task)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
[rank0]: File "/ai/shenwei/workspace/codes/AReaL/areal/infra/controller/train_controller.py", line 329, in _get_dp_head
[rank0]: return await asyncio.gather(*tasks)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/ai/shenwei/workspace/codes/AReaL/areal/infra/scheduler/local.py", line 1365, in async_call_engine
[rank0]: raise WorkerFailedError(
[rank0]: areal.infra.scheduler.exceptions.WorkerFailedError: Worker 'actor/0' failed with exit code 0
[rank0]: Stderr output:
[rank0]: (AReaL) 20260320-11:19:29.153 PPOActor INFO: ======================================================================
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: Training Parameters:
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: importance_sampling_level: token
[rank0]: (AReaL) 20260320-11:19:29.153 PPOActor INFO: ======================================================================
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: PPOActor Configuration
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: ======================================================================
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: Mode: Decoupled PPO (off-policy)
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: log_p_behave (π_behave): FROM INFERENCE (behavior policy)
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: Proximal policy (π_prox): RECOMPUTED via forward pass (standard decoupled PPO)
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: log_p_theta (π_θ): TRAINING FORWARD PASS (current policy)
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: Importance weight cap: 5.0 (filters out tokens with extreme weights)
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: ======================================================================
[rank0]: (AReaL) 20260320-11:19:29.155 PPOActor INFO: Training Parameters:
[rank0]: (AReaL) 20260320-11:19:29.155 PPOActor INFO: importance_sampling_level: token
[rank0]: (AReaL) 20260320-11:19:29.155 PPOActor INFO: adv_norm: NormConfig(mean_level='batch', mean_leave1out=False, std_level='batch', std_unbiased=True, eps=1e-05, group_size=1)
[rank0]: (AReaL) 20260320-11:19:29.155 PPOActor INFO: reward_norm: NormConfig(mean_level='group', mean_leave1out=False, std_level='group', std_unbiased=True, eps=1e-05, group_size=1)
[rank0]: (AReaL) 20260320-11:19:29.155 PPOActor INFO: eps_clip: 0.4
[rank0]: (AReaL) 20260320-11:19:29.155 PPOActor INFO: ======================================================================
[rank0]: (AReaL) 20260320-11:19:29.155 SyncRPCServer INFO: Engine 'actor/0' (class: areal.engine.fsdp_engine.FSDPPPOActor) instantiated successfully
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: adv_norm: NormConfig(mean_level='batch', mean_leave1out=False, std_level='batch', std_unbiased=True, eps=1e-05, group_size=1)
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: reward_norm: NormConfig(mean_level='group', mean_leave1out=False, std_level='group', std_unbiased=True, eps=1e-05, group_size=1)
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: eps_clip: 0.4
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: ======================================================================
[rank0]: (AReaL) 20260320-11:19:29.154 SyncRPCServer INFO: Engine 'actor/1' (class: areal.engine.fsdp_engine.FSDPPPOActor) instantiated successfully
[rank0]: [Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[rank0]: [Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[rank0]: (AReaL) 20260320-11:19:29.489 [FSDPEngine Rank 0] INFO: Initializing device mesh with parallel dims (dp=2, sp=1, tp=1, ep=1, etp=1, world_size=2).
[rank0]: (AReaL) 20260320-11:19:29.491 [FSDPEngine Rank 1] INFO: Initializing device mesh with parallel dims (dp=2, sp=1, tp=1, ep=1, etp=1, world_size=2).
[rank0]: (AReaL) 20260320-11:19:29.558 [FSDPEngine Rank 0] INFO: Data parallel head 0 and rank 0
[rank0]: (AReaL) 20260320-11:19:29.597 [FSDPEngine Rank 1] INFO: Data parallel head 1 and rank 1
[rank0]: (AReaL) 20260320-11:19:42.296 IOStruct INFO: Memory-Usage before model creation/loading: memory allocated (GB): 0.00, memory reserved (GB): 0.00, device memory used/total (GB): 1.22/79.32

[rank0]: Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
[rank0]: Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
[rank0]: Loading checkpoint shards: 33%|███▎ | 1/3 [00:02<00:05, 2.78s/it]
[rank0]: Loading checkpoint shards: 33%|███▎ | 1/3 [00:05<00:10, 5.02s/it]
[rank0]: Loading checkpoint shards: 67%|██████▋ | 2/3 [00:06<00:03, 3.17s/it]
[rank0]: Loading checkpoint shards: 100%|██████████| 3/3 [00:06<00:00, 1.81s/it]
[rank0]: Loading checkpoint shards: 100%|██████████| 3/3 [00:06<00:00, 2.14s/it]
[rank0]: (AReaL) 20260320-11:19:53.528 IOStruct INFO: Memory-Usage after model creation/loading: memory allocated (GB): 7.52, memory reserved (GB): 7.54, device memory used/total (GB): 8.76/79.32
[rank0]: (AReaL) 20260320-11:19:53.532 [FSDPEngine Rank 0] INFO: Model creation and loading time: 7.77s
[rank0]: (AReaL) 20260320-11:19:54.305 [FSDPEngine Rank 0] INFO: Applying FSDP2 with N-D parallelism for 0.77 seconds
[rank0]: (AReaL) 20260320-11:19:54.310 [FSDPEngine Rank 0] INFO: Create optimizer time: 0.004012462683022022

[rank0]: Loading checkpoint shards: 67%|██████▋ | 2/3 [00:11<00:05, 5.75s/it]
[rank0]: Loading checkpoint shards: 100%|██████████| 3/3 [00:11<00:00, 3.23s/it]
[rank0]: Loading checkpoint shards: 100%|██████████| 3/3 [00:11<00:00, 3.84s/it]
[rank0]: (AReaL) 20260320-11:19:58.793 [FSDPEngine Rank 1] INFO: Model creation and loading time: 13.05s
[rank0]: (AReaL) 20260320-11:19:59.737 [FSDPEngine Rank 1] INFO: Applying FSDP2 with N-D parallelism for 0.94 seconds
[rank0]: (AReaL) 20260320-11:19:59.742 [FSDPEngine Rank 1] INFO: Create optimizer time: 0.004047151654958725

(AReaL) 20260320-11:20:06.243 LocalScheduler INFO: Deleting 2 workers for role 'actor'
(AReaL) 20260320-11:20:06.245 LocalScheduler ERROR: Error cleaning up worker actor/0: 'psutil'
Traceback (most recent call last):
File "/ai/shenwei/workspace/codes/AReaL/areal/infra/scheduler/local.py", line 978, in _cleanup_workers
kill_process_tree(worker_info.process.pid, timeout=3, graceful=True)
File "/ai/shenwei/workspace/codes/AReaL/areal/infra/utils/proc.py", line 148, in kill_process_tree
parent = psutil.Process(parent_pid)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ai/shenwei/workspace/codes/AReaL/.venv/lib/python3.11/site-packages/ray/thirdparty_files/psutil/init.py", line 314, in init
self._init(pid)
File "/ai/shenwei/workspace/codes/AReaL/.venv/lib/python3.11/site-packages/ray/thirdparty_files/psutil/init.py", line 341, in _init
self._proc = _psplatform.Process(pid)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/ai/shenwei/workspace/codes/AReaL/.venv/lib/python3.11/site-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1628, in init
self._procfs_path = get_procfs_path()
^^^^^^^^^^^^^^^^^
File "/ai/shenwei/workspace/codes/AReaL/.venv/lib/python3.11/site-packages/ray/thirdparty_files/psutil/_common.py", line 754, in get_procfs_path
return sys.modules['psutil'].PROCFS_PATH
^^^^^^^^^^^^^^^^^^^^^
KeyError: 'psutil'
(AReaL) 20260320-11:20:06.560 LocalScheduler ERROR: Error cleaning up worker actor/1: 'psutil'
Traceback (most recent call last):
File "/ai/shenwei/workspace/codes/AReaL/areal/infra/scheduler/local.py", line 978, in _cleanup_workers
kill_process_tree(worker_info.process.pid, timeout=3, graceful=True)
File "/ai/shenwei/workspace/codes/AReaL/areal/infra/utils/proc.py", line 148, in kill_process_tree
parent = psutil.Process(parent_pid)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ai/shenwei/workspace/codes/AReaL/.venv/lib/python3.11/site-packages/ray/thirdparty_files/psutil/init.py", line 314, in init
self._init(pid)
File "/ai/shenwei/workspace/codes/AReaL/.venv/lib/python3.11/site-packages/ray/thirdparty_files/psutil/init.py", line 341, in _init
self._proc = _psplatform.Process(pid)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/ai/shenwei/workspace/codes/AReaL/.venv/lib/python3.11/site-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1628, in init
self._procfs_path = get_procfs_path()
^^^^^^^^^^^^^^^^^
File "/ai/shenwei/workspace/codes/AReaL/.venv/lib/python3.11/site-packages/ray/thirdparty_files/psutil/_common.py", line 754, in get_procfs_path
return sys.modules['psutil'].PROCFS_PATH
^^^^^^^^^^^^^^^^^^^^^
KeyError: 'psutil'
(AReaL) 20260320-11:20:06.561 LocalScheduler INFO: Successfully deleted workers for role 'actor'

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions