Closed
Description
Right now you cannot reuse the lighthouse because it keeps members indefinitely when trying to form quorum.
Method to reproduce issue
- Start lighthouse
RUST_BACKTRACE=1 uv run torchft_lighthouse --min_replicas 2 --quorum_tick_ms 100 --join_timeout_ms 100
- Start a worker then kill it
minimal_join.py
from torchft import Manager, ProcessGroupGloo
from datetime import timedelta
pg = ProcessGroupGloo()
manager = Manager(
pg=pg,
min_replica_size=2,
load_state_dict=lambda x: None,
state_dict=lambda: {},
replica_id=f"test",
timeout=timedelta(seconds=30),
use_async_quorum=False,
)
while True:
manager.start_quorum("start", allow_heal=False)
manager.should_commit()
if manager.current_step() > 20:
break
TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --nproc-per-node 1 minimal_join.py
Then Ctrl+C and terminate the joining process.
- The lighthouse will think theres 1 participant waiting for quorum indefinitely
torchft::lighthouse: 2025-01-09T19:50:55.422+00:00 - INFO start: No quorum, only have 1 participants, need 2
torchft::lighthouse: 2025-01-09T19:50:55.523+00:00 - INFO start: No quorum, only have 1 participants, need 2
torchft::lighthouse: 2025-01-09T19:50:55.625+00:00 - INFO start: No quorum, only have 1 participants, need 2
torchft::lighthouse: 2025-01-09T19:50:55.725+00:00 - INFO start: No quorum, only have 1 participants, need 2
torchft::lighthouse: 2025-01-09T19:50:55.826+00:00 - INFO start: No quorum, only have 1 participants, need 2
torchft::lighthouse: 2025-01-09T19:50:55.927+00:00 - INFO start: No quorum, only have 1 participants, need 2
torchft::lighthouse: 2025-01-09T19:50:56.029+00:00 - INFO start: No quorum, only have 1 participants, need 2
- A subsequent join will get a stale quorum
TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --nproc-per-node 1 minimal_join.py
Metadata
Assignees
Labels
No labels