Description
When two full Exo instances are running on separate Apple Silicon Macs connected via Thunderbolt bridge (link-local network), they fail to establish a libp2p connection. However, simpler configurations using the same networking stack work correctly.
Environment
- macOS (Darwin), Apple Silicon (M4 Pro + M4)
- Exo commit: fd5b232
- Python 3.13, libp2p with gossipsub, noise, yamux, pnet
- Two machines connected via Thunderbolt bridge (169.254.x.x link-local)
Reproduction
Works ✅
- Simple test ↔ Simple test: Two minimal scripts that create
NetworkingHandle, subscribe to topics, and call recv() — connection established, messages exchanged.
- Simple test ↔ Full Exo: One side runs a minimal test script, the other runs full
exo — connection established.
- Router + Election only ↔ Router + Election only: Using
Router and Election without other components — connection established.
Fails ❌
- Full Exo ↔ Full Exo: Both sides running
exo main entry point — connection never established. No PyFromSwarm_Connection event is received on either side.
Observations
-
The NetworkingHandle and libp2p swarm work correctly in isolation — the Rust networking layer successfully discovers peers, dials, and exchanges gossipsub messages in simpler configurations.
-
When running full Exo on both sides, the swarm event loop (using tokio::select! between from_client.recv() and swarm.next()) appears to not yield connection events, despite the same bootstrap peer configuration that works in simpler tests.
-
The issue may be related to how the swarm event loop handles concurrent subscriptions and message processing when many topics are registered simultaneously during startup. In the full Exo startup, 6 topics are subscribed sequentially, and multiple components (Router, EventRouter, Election, Master, Worker, DownloadCoordinator, API) are started concurrently.
-
Both machines use identical Rust binaries (same .so file).
Suspected Root Cause
The tokio::select! loop in swarm.rs may experience starvation or event processing issues when the full set of Exo components generates high channel traffic during startup. The from_client.recv() arm may be consuming events (topic subscriptions, publish commands) at a rate that prevents the swarm.next() arm from processing connection-related swarm events.
Additional Context
An election race condition was also identified and fixed separately: when two connection events arrive in quick succession, the second event's campaign cancels the first, causing both nodes to elect themselves as master. This was fixed by skipping connection-triggered elections when a campaign is already in progress. However, this fix is irrelevant to the core connection failure described above, since connections are never established in the first place.
Description
When two full Exo instances are running on separate Apple Silicon Macs connected via Thunderbolt bridge (link-local network), they fail to establish a libp2p connection. However, simpler configurations using the same networking stack work correctly.
Environment
Reproduction
Works ✅
NetworkingHandle, subscribe to topics, and callrecv()— connection established, messages exchanged.exo— connection established.RouterandElectionwithout other components — connection established.Fails ❌
exomain entry point — connection never established. NoPyFromSwarm_Connectionevent is received on either side.Observations
The
NetworkingHandleand libp2p swarm work correctly in isolation — the Rust networking layer successfully discovers peers, dials, and exchanges gossipsub messages in simpler configurations.When running full Exo on both sides, the swarm event loop (using
tokio::select!betweenfrom_client.recv()andswarm.next()) appears to not yield connection events, despite the same bootstrap peer configuration that works in simpler tests.The issue may be related to how the swarm event loop handles concurrent subscriptions and message processing when many topics are registered simultaneously during startup. In the full Exo startup, 6 topics are subscribed sequentially, and multiple components (
Router,EventRouter,Election,Master,Worker,DownloadCoordinator,API) are started concurrently.Both machines use identical Rust binaries (same
.sofile).Suspected Root Cause
The
tokio::select!loop inswarm.rsmay experience starvation or event processing issues when the full set of Exo components generates high channel traffic during startup. Thefrom_client.recv()arm may be consuming events (topic subscriptions, publish commands) at a rate that prevents theswarm.next()arm from processing connection-related swarm events.Additional Context
An election race condition was also identified and fixed separately: when two connection events arrive in quick succession, the second event's campaign cancels the first, causing both nodes to elect themselves as master. This was fixed by skipping connection-triggered elections when a campaign is already in progress. However, this fix is irrelevant to the core connection failure described above, since connections are never established in the first place.