Skip to content

Conversation

@srogawski-nvidia
Copy link
Contributor

No description provided.

@srogawski-nvidia srogawski-nvidia deleted the lint_code branch October 15, 2024 07:59
jbieniusiewi added a commit that referenced this pull request Oct 15, 2024
jbieniusiewi added a commit that referenced this pull request Oct 15, 2024
hexinw-nvidia added a commit to hexinw-nvidia/nvidia-resiliency-ext that referenced this pull request May 21, 2025
Example NVIDIA#1: Simulate NIC link down on local rank 1 after 60 seconds.

ft_launcher --nproc_per_node=4 --max-restarts=3 \
    --ft-initial-rank-heartbeat-timeout=30 \
    --ft-rank-heartbeat-timeout=15 \
    --ft-simulate-failure-type=nic \
    --ft-simulate-failure-rank=1 \
    --ft-simulate-failure-time=60  \
    examples/fault_tolerance/basic_ft_example.py

Example NVIDIA#2: Simulate GPU healthcheck failure on local rank 1 after 60 seconds.

ft_launcher --nproc_per_node=4 --max-restarts=3 \
    --ft-initial-rank-heartbeat-timeout=30 \
    --ft-rank-heartbeat-timeout=15 \
    --ft-simulate-failure-type=gpu \
    --ft-simulate-failure-rank=1 \
    --ft-simulate-failure-time=60 \
    --ft-simulate-recovery-action=gpu_reset \
    examples/fault_tolerance/basic_ft_example.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants