-
Notifications
You must be signed in to change notification settings - Fork 419
Description
(AReaL) root@ai:/ai/shenwei/workspace/codes/AReaL# python examples/openclaw/train.py --config examples/openclaw/config.yaml experiment_name=my-exp trial_name=trial-0 allocation_mode=sglang:d2t1+fsdp:d2t1 actor.pat
h=Qwen/Qwen3-4B scheduler.type=local rollout.openai.admin_api_key=sk-test123456
运行上述问题异常退出,提示没有 psutil 模块,但是这块如果没有捕捉到异常,修改 proc.py代码捕捉到不管他,依然会出错。
(AReaL) 20260320-11:01:20.883 CLIArgs WARNING: behave_imp_weight_cap and behave_imp_weight_mode are configured but use_decoupled_loss=False. These settings will be ignored. Set use_decoupled_loss=True to enable decoupled loss with importance weight correction.
(AReaL) 20260320-11:01:28.732 PlatformInit INFO: Detected CUDA device: NVIDIA A800 80GB PCIE
(AReaL) 20260320-11:01:28.742 PlatformInit INFO: Initializing CUDA platform (NVIDIA).
(AReaL) 20260320-11:01:28.857 NameResolve INFO: Removing name resolve path: /ai/shenwei/workspace/codes/areal-experiments/name_resolve/root/my-exp/trial-0
(AReaL) 20260320-11:01:30.011 LocalScheduler INFO: LocalScheduler initialized with GPU devices: [0, 1, 2, 3], log directory: /ai/shenwei/workspace/codes/areal-experiments/clawrl/logs/root/my-exp/trial-0
/ai/shenwei/workspace/codes/AReaL/areal/engine/fsdp_utils/grad.py:35: UserWarning: Transformer Engine and Apex are not installed. Falling back to local implementations of multi_tensor_applier, multi_tensor_l2norm, and multi_tensor_scale
warnings.warn(
(AReaL) 20260320-11:02:36.185 TreeAttentionFSDP INFO: Compiled torch flex attention. Options: {'epilogue_fusion': True, 'max_autotune': False, 'shape_padding': True, 'trace.enabled': False, 'triton.cudagraphs': False}, dynamic: True
(AReaL) 20260320-11:02:36.452 TreeAttentionFSDP INFO: Using block mask in flex attention, block size: 128
[W320 11:02:43.881458139 socket.cpp:209] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(AReaL) 20260320-11:02:43.020 TrainController INFO: Creating workers via scheduler...
(AReaL) 20260320-11:02:43.021 LocalScheduler INFO: Creating 2 workers for role 'actor' (strategy: SchedulingStrategyType.separation, colocate_with: None)
(AReaL) 20260320-11:02:43.022 LauncherUtils INFO: Auto-setting thread env vars to 8: OMP_NUM_THREADS, MKL_NUM_THREADS, OPENBLAS_NUM_THREADS, VECLIB_MAXIMUM_THREADS, NUMEXPR_NUM_THREADS
(AReaL) 20260320-11:02:43.023 LocalScheduler INFO: Starting worker actor/0: python3 -m areal.infra.rpc.rpc_server --port 7851 --experiment-name my-exp --trial-name trial-0 --role actor --worker-index 0 --name-resolve-type nfs --nfs-record-root /ai/shenwei/workspace/codes/areal-experiments/name_resolve --etcd3-addr localhost:2379 --fileroot /ai/shenwei/workspace/codes/areal-experiments/clawrl
(AReaL) 20260320-11:02:43.136 LocalScheduler INFO: Worker actor/0 started (PID: 1487796, GPUs: [0], ports: [7851, 34213])
(AReaL) 20260320-11:02:43.138 LauncherUtils INFO: Auto-setting thread env vars to 8: OMP_NUM_THREADS, MKL_NUM_THREADS, OPENBLAS_NUM_THREADS, VECLIB_MAXIMUM_THREADS, NUMEXPR_NUM_THREADS
(AReaL) 20260320-11:02:43.139 LocalScheduler INFO: Starting worker actor/1: python3 -m areal.infra.rpc.rpc_server --port 16368 --experiment-name my-exp --trial-name trial-0 --role actor --worker-index 1 --name-resolve-type nfs --nfs-record-root /ai/shenwei/workspace/codes/areal-experiments/name_resolve --etcd3-addr localhost:2379 --fileroot /ai/shenwei/workspace/codes/areal-experiments/clawrl
(AReaL) 20260320-11:02:43.255 LocalScheduler INFO: Worker actor/1 started (PID: 1487801, GPUs: [1], ports: [16368, 56139])
(AReaL) 20260320-11:02:43.256 LocalScheduler INFO: Successfully created 2 workers for role 'actor'
(AReaL) 20260320-11:17:44.504 SyncRPCServer INFO: Starting sync RPC server on 162.30.1.19:16368 for worker actor/1
(AReaL) 20260320-11:17:44.504 SyncRPCServer INFO: Werkzeug log level: WARNING
(AReaL) 20260320-11:17:44.590 SyncRPCServer INFO: Starting sync RPC server on 162.30.1.19:7851 for worker actor/0
(AReaL) 20260320-11:17:44.590 SyncRPCServer INFO: Werkzeug log level: WARNING
(AReaL) 20260320-11:17:44.606 SyncRPCServer INFO: Engine thread started
(AReaL) 20260320-11:17:44.607 SyncRPCServer INFO: Engine thread initialized
(AReaL) 20260320-11:17:45.067 PlatformInit INFO: Detected CUDA device: NVIDIA A800 80GB PCIE
(AReaL) 20260320-11:17:45.067 PlatformInit INFO: Initializing CUDA platform (NVIDIA).
(AReaL) 20260320-11:17:45.070 LocalScheduler INFO: Configuration successfully on worker 'actor/0'
(AReaL) 20260320-11:17:45.087 SyncRPCServer INFO: Engine thread started
(AReaL) 20260320-11:17:45.088 SyncRPCServer INFO: Engine thread initialized
(AReaL) 20260320-11:17:45.187 PlatformInit INFO: Detected CUDA device: NVIDIA A800 80GB PCIE
(AReaL) 20260320-11:17:45.187 PlatformInit INFO: Initializing CUDA platform (NVIDIA).
(AReaL) 20260320-11:17:45.190 LocalScheduler INFO: Configuration successfully on worker 'actor/1'
(AReaL) 20260320-11:17:45.192 TrainController INFO: Workers created: ['actor/0', 'actor/1']
(AReaL) 20260320-11:17:45.192 TrainController INFO: Waiting for workers to be ready...
(AReaL) 20260320-11:17:45.204 LocalScheduler INFO: All 2 workers for role 'actor' are ready
(AReaL) 20260320-11:17:45.205 TrainController INFO: Workers ready: ['actor/0', 'actor/1']
(AReaL) 20260320-11:17:45.206 TrainController INFO: Distributed training: MASTER_ADDR=162.30.1.19, MASTER_PORT=34213
(AReaL) 20260320-11:17:45.208 TrainController INFO: Creating engines on workers...
(AReaL) 20260320-11:17:45.213 SyncRPCServer INFO: Set RANK=0
(AReaL) 20260320-11:17:45.213 SyncRPCServer INFO: Set RANK=1
(AReaL) 20260320-11:17:45.214 SyncRPCServer INFO: Set WORLD_SIZE=2
(AReaL) 20260320-11:17:45.214 SyncRPCServer INFO: Set MASTER_ADDR=162.30.1.19
(AReaL) 20260320-11:17:45.214 SyncRPCServer INFO: Set WORLD_SIZE=2
(AReaL) 20260320-11:17:45.214 SyncRPCServer INFO: Set MASTER_PORT=34213
(AReaL) 20260320-11:17:45.214 SyncRPCServer INFO: Set LOCAL_RANK=0
(AReaL) 20260320-11:17:45.214 SyncRPCServer INFO: Set MASTER_ADDR=162.30.1.19
(AReaL) 20260320-11:17:45.214 SyncRPCServer INFO: Set MASTER_PORT=34213
(AReaL) 20260320-11:17:45.215 SyncRPCServer INFO: Set LOCAL_RANK=0
/ai/shenwei/workspace/codes/AReaL/areal/engine/fsdp_utils/grad.py:35: UserWarning: Transformer Engine and Apex are not installed. Falling back to local implementations of multi_tensor_applier, multi_tensor_l2norm, and multi_tensor_scale
warnings.warn(
/ai/shenwei/workspace/codes/AReaL/areal/engine/fsdp_utils/grad.py:35: UserWarning: Transformer Engine and Apex are not installed. Falling back to local implementations of multi_tensor_applier, multi_tensor_l2norm, and multi_tensor_scale
warnings.warn(
(AReaL) 20260320-11:19:27.487 TreeAttentionFSDP INFO: Compiled torch flex attention. Options: {'epilogue_fusion': True, 'max_autotune': False, 'shape_padding': True, 'trace.enabled': False, 'triton.cudagraphs': False}, dynamic: True
(AReaL) 20260320-11:19:27.490 TreeAttentionFSDP INFO: Compiled torch flex attention. Options: {'epilogue_fusion': True, 'max_autotune': False, 'shape_padding': True, 'trace.enabled': False, 'triton.cudagraphs': False}, dynamic: True
(AReaL) 20260320-11:19:27.543 TreeAttentionFSDP INFO: Using block mask in flex attention, block size: 128
(AReaL) 20260320-11:19:27.544 TreeAttentionFSDP INFO: Using block mask in flex attention, block size: 128
(AReaL) 20260320-11:19:29.152 PPOActor INFO: ======================================================================
(AReaL) 20260320-11:19:29.153 PPOActor INFO: PPOActor Configuration
(AReaL) 20260320-11:19:29.153 PPOActor INFO: ======================================================================
(AReaL) 20260320-11:19:29.153 PPOActor INFO: Mode: Decoupled PPO (off-policy)
(AReaL) 20260320-11:19:29.153 PPOActor INFO: log_p_behave (π_behave): FROM INFERENCE (behavior policy)
(AReaL) 20260320-11:19:29.153 PPOActor INFO: Proximal policy (π_prox): RECOMPUTED via forward pass (standard decoupled PPO)
(AReaL) 20260320-11:19:29.153 PPOActor INFO: log_p_theta (π_θ): TRAINING FORWARD PASS (current policy)
(AReaL) 20260320-11:19:29.153 PPOActor INFO: Importance weight cap: 5.0 (filters out tokens with extreme weights)
(AReaL) 20260320-11:19:29.153 PPOActor INFO: ======================================================================
(AReaL) 20260320-11:19:29.153 PPOActor INFO: ======================================================================
(AReaL) 20260320-11:19:29.154 PPOActor INFO: Training Parameters:
(AReaL) 20260320-11:19:29.154 PPOActor INFO: importance_sampling_level: token
(AReaL) 20260320-11:19:29.154 PPOActor INFO: adv_norm: NormConfig(mean_level='batch', mean_leave1out=False, std_level='batch', std_unbiased=True, eps=1e-05, group_size=1)
(AReaL) 20260320-11:19:29.154 PPOActor INFO: PPOActor Configuration
(AReaL) 20260320-11:19:29.154 PPOActor INFO: ======================================================================
(AReaL) 20260320-11:19:29.154 PPOActor INFO: Mode: Decoupled PPO (off-policy)
(AReaL) 20260320-11:19:29.154 PPOActor INFO: log_p_behave (π_behave): FROM INFERENCE (behavior policy)
(AReaL) 20260320-11:19:29.154 PPOActor INFO: Proximal policy (π_prox): RECOMPUTED via forward pass (standard decoupled PPO)
(AReaL) 20260320-11:19:29.154 PPOActor INFO: log_p_theta (π_θ): TRAINING FORWARD PASS (current policy)
(AReaL) 20260320-11:19:29.154 PPOActor INFO: Importance weight cap: 5.0 (filters out tokens with extreme weights)
(AReaL) 20260320-11:19:29.154 PPOActor INFO: ======================================================================
(AReaL) 20260320-11:19:29.155 PPOActor INFO: Training Parameters:
(AReaL) 20260320-11:19:29.155 PPOActor INFO: importance_sampling_level: token
(AReaL) 20260320-11:19:29.155 PPOActor INFO: adv_norm: NormConfig(mean_level='batch', mean_leave1out=False, std_level='batch', std_unbiased=True, eps=1e-05, group_size=1)
(AReaL) 20260320-11:19:29.155 PPOActor INFO: reward_norm: NormConfig(mean_level='group', mean_leave1out=False, std_level='group', std_unbiased=True, eps=1e-05, group_size=1)
(AReaL) 20260320-11:19:29.155 PPOActor INFO: eps_clip: 0.4
(AReaL) 20260320-11:19:29.155 PPOActor INFO: ======================================================================
(AReaL) 20260320-11:19:29.155 SyncRPCServer INFO: Engine 'actor/0' (class: areal.engine.fsdp_engine.FSDPPPOActor) instantiated successfully
(AReaL) 20260320-11:19:29.154 PPOActor INFO: reward_norm: NormConfig(mean_level='group', mean_leave1out=False, std_level='group', std_unbiased=True, eps=1e-05, group_size=1)
(AReaL) 20260320-11:19:29.154 PPOActor INFO: eps_clip: 0.4
(AReaL) 20260320-11:19:29.154 PPOActor INFO: ======================================================================
(AReaL) 20260320-11:19:29.154 SyncRPCServer INFO: Engine 'actor/1' (class: areal.engine.fsdp_engine.FSDPPPOActor) instantiated successfully
(AReaL) 20260320-11:19:29.159 TrainController INFO: Engines created on all workers!
(AReaL) 20260320-11:19:29.162 TrainController INFO: Calling engine initialization...
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
(AReaL) 20260320-11:19:29.489 [FSDPEngine Rank 0] INFO: Initializing device mesh with parallel dims (dp=2, sp=1, tp=1, ep=1, etp=1, world_size=2).
(AReaL) 20260320-11:19:29.491 [FSDPEngine Rank 1] INFO: Initializing device mesh with parallel dims (dp=2, sp=1, tp=1, ep=1, etp=1, world_size=2).
(AReaL) 20260320-11:19:29.558 [FSDPEngine Rank 0] INFO: Data parallel head 0 and rank 0
(AReaL) 20260320-11:19:29.597 [FSDPEngine Rank 1] INFO: Data parallel head 1 and rank 1
(AReaL) 20260320-11:19:42.296 IOStruct INFO: Memory-Usage before model creation/loading: memory allocated (GB): 0.00, memory reserved (GB): 0.00, device memory used/total (GB): 1.22/79.32
Loading checkpoint shards: 100%|██████████| 3/3 [00:06<00:00, 2.14s/it]
(AReaL) 20260320-11:19:53.528 IOStruct INFO: Memory-Usage after model creation/loading: memory allocated (GB): 7.52, memory reserved (GB): 7.54, device memory used/total (GB): 8.76/79.32
(AReaL) 20260320-11:19:53.532 [FSDPEngine Rank 0] INFO: Model creation and loading time: 7.77s
(AReaL) 20260320-11:19:54.305 [FSDPEngine Rank 0] INFO: Applying FSDP2 with N-D parallelism for 0.77 seconds
(AReaL) 20260320-11:19:54.310 [FSDPEngine Rank 0] INFO: Create optimizer time: 0.004012462683022022
Loading checkpoint shards: 100%|██████████| 3/3 [00:11<00:00, 3.84s/it]
(AReaL) 20260320-11:19:58.793 [FSDPEngine Rank 1] INFO: Model creation and loading time: 13.05s
(AReaL) 20260320-11:19:59.737 [FSDPEngine Rank 1] INFO: Applying FSDP2 with N-D parallelism for 0.94 seconds
(AReaL) 20260320-11:19:59.742 [FSDPEngine Rank 1] INFO: Create optimizer time: 0.004047151654958725
(AReaL) 20260320-11:19:59.745 TrainController INFO: All engines are initialized!
(AReaL) 20260320-11:19:59.747 TrainController INFO: Identifying DP head workers...
(AReaL) 20260320-11:20:01.125 LocalScheduler WARNING: Method 'is_data_parallel_head' failed on worker 'actor/0' (attempt 1/3): Connection error: Server disconnected. Retrying in 1.0s...
(AReaL) 20260320-11:20:01.774 LocalScheduler WARNING: Method 'is_data_parallel_head' failed on worker 'actor/1' (attempt 1/3): Connection error: Server disconnected. Retrying in 1.0s...
[rank0]: Traceback (most recent call last):
[rank0]: File "/ai/shenwei/workspace/codes/AReaL/examples/openclaw/train.py", line 17, in
[rank0]: main(sys.argv[1:])
[rank0]: File "/ai/shenwei/workspace/codes/AReaL/examples/openclaw/train.py", line 12, in main
[rank0]: with PPOTrainer(config) as trainer:
[rank0]: ^^^^^^^^^^^^^^^^^^
[rank0]: File "/ai/shenwei/workspace/codes/AReaL/areal/trainer/rl_trainer.py", line 189, in init
[rank0]: self.actor.initialize(**engine_init_kwargs, role="actor")
[rank0]: File "/ai/shenwei/workspace/codes/AReaL/areal/infra/controller/train_controller.py", line 252, in initialize
[rank0]: self._identify_dp_heads()
[rank0]: File "/ai/shenwei/workspace/codes/AReaL/areal/infra/controller/train_controller.py", line 331, in _identify_dp_heads
[rank0]: self.workers_is_dp_head = run_async_task(_get_dp_head)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/ai/shenwei/workspace/codes/AReaL/areal/infra/utils/concurrent.py", line 69, in run_async_task
[rank0]: return asyncio.run(func(*args, **kwargs))
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/root/.local/share/uv/python/cpython-3.11.15-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run
[rank0]: return runner.run(main)
[rank0]: ^^^^^^^^^^^^^^^^
[rank0]: File "/root/.local/share/uv/python/cpython-3.11.15-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run
[rank0]: return self._loop.run_until_complete(task)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
[rank0]: File "/ai/shenwei/workspace/codes/AReaL/areal/infra/controller/train_controller.py", line 329, in _get_dp_head
[rank0]: return await asyncio.gather(*tasks)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/ai/shenwei/workspace/codes/AReaL/areal/infra/scheduler/local.py", line 1365, in async_call_engine
[rank0]: raise WorkerFailedError(
[rank0]: areal.infra.scheduler.exceptions.WorkerFailedError: Worker 'actor/0' failed with exit code 0
[rank0]: Stderr output:
[rank0]: (AReaL) 20260320-11:19:29.153 PPOActor INFO: ======================================================================
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: Training Parameters:
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: importance_sampling_level: token
[rank0]: (AReaL) 20260320-11:19:29.153 PPOActor INFO: ======================================================================
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: PPOActor Configuration
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: ======================================================================
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: Mode: Decoupled PPO (off-policy)
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: log_p_behave (π_behave): FROM INFERENCE (behavior policy)
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: Proximal policy (π_prox): RECOMPUTED via forward pass (standard decoupled PPO)
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: log_p_theta (π_θ): TRAINING FORWARD PASS (current policy)
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: Importance weight cap: 5.0 (filters out tokens with extreme weights)
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: ======================================================================
[rank0]: (AReaL) 20260320-11:19:29.155 PPOActor INFO: Training Parameters:
[rank0]: (AReaL) 20260320-11:19:29.155 PPOActor INFO: importance_sampling_level: token
[rank0]: (AReaL) 20260320-11:19:29.155 PPOActor INFO: adv_norm: NormConfig(mean_level='batch', mean_leave1out=False, std_level='batch', std_unbiased=True, eps=1e-05, group_size=1)
[rank0]: (AReaL) 20260320-11:19:29.155 PPOActor INFO: reward_norm: NormConfig(mean_level='group', mean_leave1out=False, std_level='group', std_unbiased=True, eps=1e-05, group_size=1)
[rank0]: (AReaL) 20260320-11:19:29.155 PPOActor INFO: eps_clip: 0.4
[rank0]: (AReaL) 20260320-11:19:29.155 PPOActor INFO: ======================================================================
[rank0]: (AReaL) 20260320-11:19:29.155 SyncRPCServer INFO: Engine 'actor/0' (class: areal.engine.fsdp_engine.FSDPPPOActor) instantiated successfully
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: adv_norm: NormConfig(mean_level='batch', mean_leave1out=False, std_level='batch', std_unbiased=True, eps=1e-05, group_size=1)
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: reward_norm: NormConfig(mean_level='group', mean_leave1out=False, std_level='group', std_unbiased=True, eps=1e-05, group_size=1)
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: eps_clip: 0.4
[rank0]: (AReaL) 20260320-11:19:29.154 PPOActor INFO: ======================================================================
[rank0]: (AReaL) 20260320-11:19:29.154 SyncRPCServer INFO: Engine 'actor/1' (class: areal.engine.fsdp_engine.FSDPPPOActor) instantiated successfully
[rank0]: [Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[rank0]: [Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[rank0]: (AReaL) 20260320-11:19:29.489 [FSDPEngine Rank 0] INFO: Initializing device mesh with parallel dims (dp=2, sp=1, tp=1, ep=1, etp=1, world_size=2).
[rank0]: (AReaL) 20260320-11:19:29.491 [FSDPEngine Rank 1] INFO: Initializing device mesh with parallel dims (dp=2, sp=1, tp=1, ep=1, etp=1, world_size=2).
[rank0]: (AReaL) 20260320-11:19:29.558 [FSDPEngine Rank 0] INFO: Data parallel head 0 and rank 0
[rank0]: (AReaL) 20260320-11:19:29.597 [FSDPEngine Rank 1] INFO: Data parallel head 1 and rank 1
[rank0]: (AReaL) 20260320-11:19:42.296 IOStruct INFO: Memory-Usage before model creation/loading: memory allocated (GB): 0.00, memory reserved (GB): 0.00, device memory used/total (GB): 1.22/79.32
[rank0]: Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
[rank0]: Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
[rank0]: Loading checkpoint shards: 33%|███▎ | 1/3 [00:02<00:05, 2.78s/it]
[rank0]: Loading checkpoint shards: 33%|███▎ | 1/3 [00:05<00:10, 5.02s/it]
[rank0]: Loading checkpoint shards: 67%|██████▋ | 2/3 [00:06<00:03, 3.17s/it]
[rank0]: Loading checkpoint shards: 100%|██████████| 3/3 [00:06<00:00, 1.81s/it]
[rank0]: Loading checkpoint shards: 100%|██████████| 3/3 [00:06<00:00, 2.14s/it]
[rank0]: (AReaL) 20260320-11:19:53.528 IOStruct INFO: Memory-Usage after model creation/loading: memory allocated (GB): 7.52, memory reserved (GB): 7.54, device memory used/total (GB): 8.76/79.32
[rank0]: (AReaL) 20260320-11:19:53.532 [FSDPEngine Rank 0] INFO: Model creation and loading time: 7.77s
[rank0]: (AReaL) 20260320-11:19:54.305 [FSDPEngine Rank 0] INFO: Applying FSDP2 with N-D parallelism for 0.77 seconds
[rank0]: (AReaL) 20260320-11:19:54.310 [FSDPEngine Rank 0] INFO: Create optimizer time: 0.004012462683022022
[rank0]: Loading checkpoint shards: 67%|██████▋ | 2/3 [00:11<00:05, 5.75s/it]
[rank0]: Loading checkpoint shards: 100%|██████████| 3/3 [00:11<00:00, 3.23s/it]
[rank0]: Loading checkpoint shards: 100%|██████████| 3/3 [00:11<00:00, 3.84s/it]
[rank0]: (AReaL) 20260320-11:19:58.793 [FSDPEngine Rank 1] INFO: Model creation and loading time: 13.05s
[rank0]: (AReaL) 20260320-11:19:59.737 [FSDPEngine Rank 1] INFO: Applying FSDP2 with N-D parallelism for 0.94 seconds
[rank0]: (AReaL) 20260320-11:19:59.742 [FSDPEngine Rank 1] INFO: Create optimizer time: 0.004047151654958725
(AReaL) 20260320-11:20:06.243 LocalScheduler INFO: Deleting 2 workers for role 'actor'
(AReaL) 20260320-11:20:06.245 LocalScheduler ERROR: Error cleaning up worker actor/0: 'psutil'
Traceback (most recent call last):
File "/ai/shenwei/workspace/codes/AReaL/areal/infra/scheduler/local.py", line 978, in _cleanup_workers
kill_process_tree(worker_info.process.pid, timeout=3, graceful=True)
File "/ai/shenwei/workspace/codes/AReaL/areal/infra/utils/proc.py", line 148, in kill_process_tree
parent = psutil.Process(parent_pid)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ai/shenwei/workspace/codes/AReaL/.venv/lib/python3.11/site-packages/ray/thirdparty_files/psutil/init.py", line 314, in init
self._init(pid)
File "/ai/shenwei/workspace/codes/AReaL/.venv/lib/python3.11/site-packages/ray/thirdparty_files/psutil/init.py", line 341, in _init
self._proc = _psplatform.Process(pid)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/ai/shenwei/workspace/codes/AReaL/.venv/lib/python3.11/site-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1628, in init
self._procfs_path = get_procfs_path()
^^^^^^^^^^^^^^^^^
File "/ai/shenwei/workspace/codes/AReaL/.venv/lib/python3.11/site-packages/ray/thirdparty_files/psutil/_common.py", line 754, in get_procfs_path
return sys.modules['psutil'].PROCFS_PATH
^^^^^^^^^^^^^^^^^^^^^
KeyError: 'psutil'
(AReaL) 20260320-11:20:06.560 LocalScheduler ERROR: Error cleaning up worker actor/1: 'psutil'
Traceback (most recent call last):
File "/ai/shenwei/workspace/codes/AReaL/areal/infra/scheduler/local.py", line 978, in _cleanup_workers
kill_process_tree(worker_info.process.pid, timeout=3, graceful=True)
File "/ai/shenwei/workspace/codes/AReaL/areal/infra/utils/proc.py", line 148, in kill_process_tree
parent = psutil.Process(parent_pid)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/ai/shenwei/workspace/codes/AReaL/.venv/lib/python3.11/site-packages/ray/thirdparty_files/psutil/init.py", line 314, in init
self._init(pid)
File "/ai/shenwei/workspace/codes/AReaL/.venv/lib/python3.11/site-packages/ray/thirdparty_files/psutil/init.py", line 341, in _init
self._proc = _psplatform.Process(pid)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/ai/shenwei/workspace/codes/AReaL/.venv/lib/python3.11/site-packages/ray/thirdparty_files/psutil/_pslinux.py", line 1628, in init
self._procfs_path = get_procfs_path()
^^^^^^^^^^^^^^^^^
File "/ai/shenwei/workspace/codes/AReaL/.venv/lib/python3.11/site-packages/ray/thirdparty_files/psutil/_common.py", line 754, in get_procfs_path
return sys.modules['psutil'].PROCFS_PATH
^^^^^^^^^^^^^^^^^^^^^
KeyError: 'psutil'
(AReaL) 20260320-11:20:06.561 LocalScheduler INFO: Successfully deleted workers for role 'actor'