Skip to content

Two full Exo instances fail to establish libp2p connection (macOS Apple Silicon) #1919

@agent504330-ux

Description

@agent504330-ux

Description

When two full Exo instances are running on separate Apple Silicon Macs connected via Thunderbolt bridge (link-local network), they fail to establish a libp2p connection. However, simpler configurations using the same networking stack work correctly.

Environment

  • macOS (Darwin), Apple Silicon (M4 Pro + M4)
  • Exo commit: fd5b232
  • Python 3.13, libp2p with gossipsub, noise, yamux, pnet
  • Two machines connected via Thunderbolt bridge (169.254.x.x link-local)

Reproduction

Works ✅

  • Simple test ↔ Simple test: Two minimal scripts that create NetworkingHandle, subscribe to topics, and call recv() — connection established, messages exchanged.
  • Simple test ↔ Full Exo: One side runs a minimal test script, the other runs full exo — connection established.
  • Router + Election only ↔ Router + Election only: Using Router and Election without other components — connection established.

Fails ❌

  • Full Exo ↔ Full Exo: Both sides running exo main entry point — connection never established. No PyFromSwarm_Connection event is received on either side.

Observations

  1. The NetworkingHandle and libp2p swarm work correctly in isolation — the Rust networking layer successfully discovers peers, dials, and exchanges gossipsub messages in simpler configurations.

  2. When running full Exo on both sides, the swarm event loop (using tokio::select! between from_client.recv() and swarm.next()) appears to not yield connection events, despite the same bootstrap peer configuration that works in simpler tests.

  3. The issue may be related to how the swarm event loop handles concurrent subscriptions and message processing when many topics are registered simultaneously during startup. In the full Exo startup, 6 topics are subscribed sequentially, and multiple components (Router, EventRouter, Election, Master, Worker, DownloadCoordinator, API) are started concurrently.

  4. Both machines use identical Rust binaries (same .so file).

Suspected Root Cause

The tokio::select! loop in swarm.rs may experience starvation or event processing issues when the full set of Exo components generates high channel traffic during startup. The from_client.recv() arm may be consuming events (topic subscriptions, publish commands) at a rate that prevents the swarm.next() arm from processing connection-related swarm events.

Additional Context

An election race condition was also identified and fixed separately: when two connection events arrive in quick succession, the second event's campaign cancels the first, causing both nodes to elect themselves as master. This was fixed by skipping connection-triggered elections when a campaign is already in progress. However, this fix is irrelevant to the core connection failure described above, since connections are never established in the first place.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions