Skip to content

Use heartbeat to invalidate stale entries #63

Closed
@Jackmin801

Description

Right now you cannot reuse the lighthouse because it keeps members indefinitely when trying to form quorum.

Method to reproduce issue

  1. Start lighthouse
RUST_BACKTRACE=1 uv run torchft_lighthouse     --min_replicas 2        --quorum_tick_ms 100    --join_timeout_ms 100
  1. Start a worker then kill it
    minimal_join.py
from torchft import Manager, ProcessGroupGloo
from datetime import timedelta

pg = ProcessGroupGloo()
manager = Manager(
	pg=pg,
	min_replica_size=2,
	load_state_dict=lambda x: None,
	state_dict=lambda: {},
	replica_id=f"test",
	timeout=timedelta(seconds=30),
	use_async_quorum=False,
)

while True:
	manager.start_quorum("start", allow_heal=False)
	manager.should_commit()
	if manager.current_step() > 20:
		break
TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --nproc-per-node 1 minimal_join.py

Then Ctrl+C and terminate the joining process.

  1. The lighthouse will think theres 1 participant waiting for quorum indefinitely
torchft::lighthouse: 2025-01-09T19:50:55.422+00:00 - INFO start: No quorum, only have 1 participants, need 2
torchft::lighthouse: 2025-01-09T19:50:55.523+00:00 - INFO start: No quorum, only have 1 participants, need 2
torchft::lighthouse: 2025-01-09T19:50:55.625+00:00 - INFO start: No quorum, only have 1 participants, need 2
torchft::lighthouse: 2025-01-09T19:50:55.725+00:00 - INFO start: No quorum, only have 1 participants, need 2
torchft::lighthouse: 2025-01-09T19:50:55.826+00:00 - INFO start: No quorum, only have 1 participants, need 2
torchft::lighthouse: 2025-01-09T19:50:55.927+00:00 - INFO start: No quorum, only have 1 participants, need 2
torchft::lighthouse: 2025-01-09T19:50:56.029+00:00 - INFO start: No quorum, only have 1 participants, need 2
  1. A subsequent join will get a stale quorum
TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --nproc-per-node 1 minimal_join.py

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions