Two full Exo instances fail to establish libp2p connection (macOS Apple Silicon)

## Description

When two full Exo instances are running on separate Apple Silicon Macs connected via Thunderbolt bridge (link-local network), they fail to establish a libp2p connection. However, simpler configurations using the same networking stack work correctly.

## Environment

- macOS (Darwin), Apple Silicon (M4 Pro + M4)
- Exo commit: fd5b2328
- Python 3.13, libp2p with gossipsub, noise, yamux, pnet
- Two machines connected via Thunderbolt bridge (169.254.x.x link-local)

## Reproduction

### Works ✅
- **Simple test ↔ Simple test**: Two minimal scripts that create `NetworkingHandle`, subscribe to topics, and call `recv()` — connection established, messages exchanged.
- **Simple test ↔ Full Exo**: One side runs a minimal test script, the other runs full `exo` — connection established.
- **Router + Election only ↔ Router + Election only**: Using `Router` and `Election` without other components — connection established.

### Fails ❌
- **Full Exo ↔ Full Exo**: Both sides running `exo` main entry point — connection never established. No `PyFromSwarm_Connection` event is received on either side.

## Observations

1. The `NetworkingHandle` and libp2p swarm work correctly in isolation — the Rust networking layer successfully discovers peers, dials, and exchanges gossipsub messages in simpler configurations.

2. When running full Exo on both sides, the swarm event loop (using `tokio::select!` between `from_client.recv()` and `swarm.next()`) appears to not yield connection events, despite the same bootstrap peer configuration that works in simpler tests.

3. The issue may be related to how the swarm event loop handles concurrent subscriptions and message processing when many topics are registered simultaneously during startup. In the full Exo startup, 6 topics are subscribed sequentially, and multiple components (`Router`, `EventRouter`, `Election`, `Master`, `Worker`, `DownloadCoordinator`, `API`) are started concurrently.

4. Both machines use identical Rust binaries (same `.so` file).

## Suspected Root Cause

The `tokio::select!` loop in `swarm.rs` may experience starvation or event processing issues when the full set of Exo components generates high channel traffic during startup. The `from_client.recv()` arm may be consuming events (topic subscriptions, publish commands) at a rate that prevents the `swarm.next()` arm from processing connection-related swarm events.

## Additional Context

An election race condition was also identified and fixed separately: when two connection events arrive in quick succession, the second event's campaign cancels the first, causing both nodes to elect themselves as master. This was fixed by skipping connection-triggered elections when a campaign is already in progress. However, this fix is irrelevant to the core connection failure described above, since connections are never established in the first place.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two full Exo instances fail to establish libp2p connection (macOS Apple Silicon) #1919

Description

Environment

Reproduction

Works ✅

Fails ❌

Observations

Suspected Root Cause

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Two full Exo instances fail to establish libp2p connection (macOS Apple Silicon) #1919

Description

Description

Environment

Reproduction

Works ✅

Fails ❌

Observations

Suspected Root Cause

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions