Skip to content

Conversation

@srogawski-nvidia
Copy link
Contributor

No description provided.

@srogawski-nvidia srogawski-nvidia merged commit 6079124 into main Oct 15, 2024
1 check passed
@srogawski-nvidia srogawski-nvidia deleted the add-build-docs branch October 15, 2024 07:46
hexinw-nvidia added a commit to hexinw-nvidia/nvidia-resiliency-ext that referenced this pull request May 21, 2025
Example NVIDIA#1: Simulate NIC link down on local rank 1 after 60 seconds.

ft_launcher --nproc_per_node=4 --max-restarts=3 \
    --ft-initial-rank-heartbeat-timeout=30 \
    --ft-rank-heartbeat-timeout=15 \
    --ft-simulate-failure-type=nic \
    --ft-simulate-failure-rank=1 \
    --ft-simulate-failure-time=60  \
    examples/fault_tolerance/basic_ft_example.py

Example NVIDIA#2: Simulate GPU healthcheck failure on local rank 1 after 60 seconds.

ft_launcher --nproc_per_node=4 --max-restarts=3 \
    --ft-initial-rank-heartbeat-timeout=30 \
    --ft-rank-heartbeat-timeout=15 \
    --ft-simulate-failure-type=gpu \
    --ft-simulate-failure-rank=1 \
    --ft-simulate-failure-time=60 \
    --ft-simulate-recovery-action=gpu_reset \
    examples/fault_tolerance/basic_ft_example.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants