-
Notifications
You must be signed in to change notification settings - Fork 419
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Checklist
- The error occurs when using our provided Docker image.
- I can consistently reproduce the bug across multiple trials or random seeds.
- If the error causes experiment abortion, I've verified that this error is the root
cause, not a secondary error caused by peer workers.
Detailed Information
Describe the bug
Training fails at step 0 during compute_logp.
The observed error is Connection reset by peer while the actor is localizing RTensor inputs fetched from rollout workers over HTTP. The stack in actor.log shows the failure happens inside RTensor.localize(...), during session.get(url).
This issue does not show up in small-scale tests. Small test data can pass, but the failure appears in actual multimodal training runs with large data payloads, especially when the batch contains multi-image samples.
Full logs
From main.log:
20260320-13:42:12.291 LocalScheduler INFO: Starting worker rollout/0: python3 -m areal.infra.rpc.rpc_server --port 48890 ...
20260320-13:42:12.394 LocalScheduler INFO: Worker rollout/0 started (PID: 1678125, GPUs: [7], ports: [48890, 55192])
20260320-13:56:18.566 RLTrainer ERROR: Training failed with exception: Failed to call method 'compute_logp' on worker 'actor/4': Internal server error: [Errno 104] Connection reset by peer
areal.infra.scheduler.exceptions.EngineCallError: Failed to call method 'compute_logp' on worker 'actor/4': Internal server error: [Errno 104] Connection reset by peer
From actor.log:
20260320-13:56:18.564 SyncRPCServer ERROR: Unexpected error in call: [Errno 104] Connection reset by peer
ConnectionResetError: [Errno 104] Connection reset by peer
args = RTensor.localize(raw_args)
async with session.get(url) as resp:
aiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working