Skip to content

Use heartbeat to invalidate stale entries #63

Closed
@Jackmin801

Description

@Jackmin801

Right now you cannot reuse the lighthouse because it keeps members indefinitely when trying to form quorum.

Method to reproduce issue

  1. Start lighthouse
RUST_BACKTRACE=1 uv run torchft_lighthouse     --min_replicas 2        --quorum_tick_ms 100    --join_timeout_ms 100
  1. Start a worker then kill it
    minimal_join.py
from torchft import Manager, ProcessGroupGloo
from datetime import timedelta

pg = ProcessGroupGloo()
manager = Manager(
	pg=pg,
	min_replica_size=2,
	load_state_dict=lambda x: None,
	state_dict=lambda: {},
	replica_id=f"test",
	timeout=timedelta(seconds=30),
	use_async_quorum=False,
)

while True:
	manager.start_quorum("start", allow_heal=False)
	manager.should_commit()
	if manager.current_step() > 20:
		break
TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --nproc-per-node 1 minimal_join.py

Then Ctrl+C and terminate the joining process.

  1. The lighthouse will think theres 1 participant waiting for quorum indefinitely
torchft::lighthouse: 2025-01-09T19:50:55.422+00:00 - INFO start: No quorum, only have 1 participants, need 2
torchft::lighthouse: 2025-01-09T19:50:55.523+00:00 - INFO start: No quorum, only have 1 participants, need 2
torchft::lighthouse: 2025-01-09T19:50:55.625+00:00 - INFO start: No quorum, only have 1 participants, need 2
torchft::lighthouse: 2025-01-09T19:50:55.725+00:00 - INFO start: No quorum, only have 1 participants, need 2
torchft::lighthouse: 2025-01-09T19:50:55.826+00:00 - INFO start: No quorum, only have 1 participants, need 2
torchft::lighthouse: 2025-01-09T19:50:55.927+00:00 - INFO start: No quorum, only have 1 participants, need 2
torchft::lighthouse: 2025-01-09T19:50:56.029+00:00 - INFO start: No quorum, only have 1 participants, need 2
  1. A subsequent join will get a stale quorum
TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --nproc-per-node 1 minimal_join.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions