Skip to content

[BUG] Connection reset by peer during RTensor fetch in compute_logp with large multimodal batches #1071

@Wangxiaoxiaoa

Description

@Wangxiaoxiaoa

Checklist

  • The error occurs when using our provided Docker image.
  • I can consistently reproduce the bug across multiple trials or random seeds.
  • If the error causes experiment abortion, I've verified that this error is the root
    cause, not a secondary error caused by peer workers.

Detailed Information

Describe the bug

Training fails at step 0 during compute_logp.

The observed error is Connection reset by peer while the actor is localizing RTensor inputs fetched from rollout workers over HTTP. The stack in actor.log shows the failure happens inside RTensor.localize(...), during session.get(url).

This issue does not show up in small-scale tests. Small test data can pass, but the failure appears in actual multimodal training runs with large data payloads, especially when the batch contains multi-image samples.

Full logs

From main.log:

20260320-13:42:12.291 LocalScheduler INFO: Starting worker rollout/0: python3 -m areal.infra.rpc.rpc_server --port 48890 ...
20260320-13:42:12.394 LocalScheduler INFO: Worker rollout/0 started (PID: 1678125, GPUs: [7], ports: [48890, 55192])

20260320-13:56:18.566 RLTrainer ERROR: Training failed with exception: Failed to call method 'compute_logp' on worker 'actor/4': Internal server error: [Errno 104] Connection reset by peer
areal.infra.scheduler.exceptions.EngineCallError: Failed to call method 'compute_logp' on worker 'actor/4': Internal server error: [Errno 104] Connection reset by peer

From actor.log:

20260320-13:56:18.564 SyncRPCServer ERROR: Unexpected error in call: [Errno 104] Connection reset by peer
ConnectionResetError: [Errno 104] Connection reset by peer
    args = RTensor.localize(raw_args)
    async with session.get(url) as resp:
aiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions