Skip to content

Conversation

@ycchenzheng
Copy link
Collaborator

@ycchenzheng ycchenzheng commented Sep 30, 2025

Description

Update disruption manager to work with the refactored recipes

FIXES: b/441333068

Tests

python3 -m benchmarks.recipes.pw_suspend_resume
The disruption SIGTERM will be triggered at the target step:

Workload 'chz-pw-llama3--2-0xt', Pod 'chz-pw-llama3--2-0xt-pathways-head-0-0-tnb77': completed step: 3, seconds: 8.813, TFLOP/s/device: 95.687, Tokens/s/device: 1859.064, total_weights: 1048576, loss: 11.411
Workload 'chz-pw-llama3--2-0xt', Pod 'chz-pw-llama3--2-0xt-pathways-head-0-0-tnb77': STEP trigger reached! Detected step: 3, Trigger Value: 3.
🔥🔥🔥 Trigger detected for workload: chz-pw-llama3--2-0xt, triggering DisruptionMethod.SIGTERM 🔥🔥🔥
🔥🔥🔥 Beginning SIGTERM for workload: chz-pw-llama3--2-0xt with pod regex: chz-pw-llama3--2-0xt.*worker-0-0.* 🔥🔥🔥
Workload 'chz-pw-llama3--2-0xt': Getting pod name matching 'chz-pw-llama3--2-0xt.*worker-0-0.*'...
Workload 'chz-pw-llama3--2-0xt': Found pod: chz-pw-llama3--2-0xt-worker-0-0-clmzf
🔥🔥🔥 Executing command in pod: kubectl exec -it chz-pw-llama3--2-0xt-worker-0-0-clmzf -c pathways-worker -- /bin/sh -c "kill -s SIGTERM 1" 🔥🔥🔥
Executing command: kubectl exec -it chz-pw-llama3--2-0xt-worker-0-0-clmzf -c pathways-worker -- /bin/sh -c "kill -s SIGTERM 1"
✅ Successfully executed command: kubectl exec -it chz-pw-llama3--2-0xt-worker-0-0-clmzf -c pathways-worker -- /bin/sh -c "kill -s SIGTERM 1"
All disruptions completed.
Benchmark recipe disruptions completed. Please check logs for results.
Suspend/Resume disruptions completed. Please check logs for results.
python3 -m benchmarks.recipes.pw_elastic_training_recipe

The disruption SIGILL will be triggered at the target step:

Starting disruption monitoring! 🔥🩺
😴 Using TimeMonitor for workload: chz-pw-llama3--2-7dm, sleeping for 120 seconds 😴.
🔥🩺 Started monitoring thread for workload: chz-pw-llama3--2-7dm
😴 Using TimeMonitor for workload: chz-pw-llama3--2-7dm, sleeping for 600 seconds 😴.
🔥🩺 Started monitoring thread for workload: chz-pw-llama3--2-7dm
😴 Using TimeMonitor for workload: chz-llama3-1-8-2-100119-oip, sleeping for 120 seconds 😴.
🔥🩺 Started monitoring thread for workload: chz-llama3-1-8-2-100119-oip
😴 Using TimeMonitor for workload: chz-llama3-1-8-2-100119-oip, sleeping for 600 seconds 😴.
🔥🩺 Started monitoring thread for workload: chz-llama3-1-8-2-100119-oip
😳 Time trigger reached after 120 seconds 😳
🔥🔥🔥 Trigger detected for workload: chz-llama3-1-8-2-100119-oip, triggering DisruptionMethod.SIGILL 🔥🔥🔥
🔥🔥🔥 Beginning SIGILL for workload: chz-llama3-1-8-2-100119-oip with pod regex: chz-llama3-1-8-2-100119-oip.*slice-job-0-0.* 🔥🔥🔥
Workload 'chz-llama3-1-8-2-100119-oip': Getting pod name matching 'chz-llama3-1-8-2-100119-oip.*slice-job-0-0.*'...
😳 Time trigger reached after 120 seconds 😳
🔥🔥🔥 Trigger detected for workload: chz-pw-llama3--2-7dm, triggering DisruptionMethod.SIGILL 🔥🔥🔥
🔥🔥🔥 Beginning SIGILL for workload: chz-pw-llama3--2-7dm with pod regex: chz-pw-llama3--2-7dm.*worker-0-0.* 🔥🔥🔥
Workload 'chz-pw-llama3--2-7dm': Getting pod name matching 'chz-pw-llama3--2-7dm.*worker-0-0.*'...
Workload 'chz-llama3-1-8-2-100119-oip': Error getting pod information: Command 'kubectl get pods -o=custom-columns=NAME:.metadata.name --no-headers | grep -E 'chz-llama3-1-8-2-100119-oip.*slice-job-0-0.*'' returned non-zero exit status 1.
Workload 'chz-pw-llama3--2-7dm': Found pod: chz-pw-llama3--2-7dm-worker-0-0-ssjc8
🔥🔥🔥 Executing command in pod: kubectl exec -it chz-pw-llama3--2-7dm-worker-0-0-ssjc8 -c pathways-worker -- /bin/sh -c "kill -s SIGILL 1" 🔥🔥🔥
Executing command: kubectl exec -it chz-pw-llama3--2-7dm-worker-0-0-ssjc8 -c pathways-worker -- /bin/sh -c "kill -s SIGILL 1"
✅ Successfully executed command: kubectl exec -it chz-pw-llama3--2-7dm-worker-0-0-ssjc8 -c pathways-worker -- /bin/sh -c "kill -s SIGILL 1"
...
python3 -m benchmarks.recipes.pw_mcjax_benchmark_recipe

It will follow the previous behaviors

Waiting for `chz-pw-llama3--2-hvz`, for 17 seconds
Waiting for `chz-pw-llama3--2-hvz`, for 18 seconds
[XPK] Check statistics and outlier mode of GKE metrics here: https://console.cloud.google.com/monitoring/dashboards/builder/5a485748-c2be-435a-9c5d-f117b464dbe1?project=cloud-tpu-multipod-dev&f.rlabel.cluster_name.ClusterName=pw-scale-test-v5e-32. To view the metric data for your workload, select chz-pw-llama3--2-hvz from the JobName filter on the dashboard.
[XPK] Follow your Pathways workload and other resources here : https://console.cloud.google.com/logs/query;query=resource.type%3D"k8s_container"%0Aresource.labels.project_id%3D"cloud-tpu-multipod-dev"%0Aresource.labels.location%3D"us-south1"%0Aresource.labels.cluster_name%3D"pw-scale-test-v5e-32"%0Aresource.labels.pod_name:"chz-pw-llama3--2-hvz-"%0Aseverity>%3DDEFAULT
[XPK] Exiting XPK cleanly
Task: `chz-pw-llama3--2-hvz` terminated with code `0`
Benchmark recipe completed. Please check logs for results.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

@ycchenzheng ycchenzheng force-pushed the chzheng/disruption_manager branch from 782a480 to 97604b2 Compare September 30, 2025 23:47
@ycchenzheng ycchenzheng self-assigned this Sep 30, 2025
@ycchenzheng ycchenzheng force-pushed the chzheng/disruption_manager branch 6 times, most recently from 58e629c to 9c3df71 Compare October 1, 2025 20:15
@ycchenzheng ycchenzheng marked this pull request as ready for review October 1, 2025 20:35
@ycchenzheng ycchenzheng force-pushed the chzheng/disruption_manager branch 2 times, most recently from cacca03 to e57aae1 Compare October 2, 2025 16:55
Copy link
Collaborator

@SujeethJinesh SujeethJinesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Just some minor comments

@ycchenzheng ycchenzheng force-pushed the chzheng/disruption_manager branch 3 times, most recently from e2bc864 to a9e1b99 Compare October 13, 2025 20:34
@ycchenzheng ycchenzheng force-pushed the chzheng/disruption_manager branch 2 times, most recently from 54203cd to 6363e4f Compare October 14, 2025 17:19
Copy link
Collaborator

@SujeethJinesh SujeethJinesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@ycchenzheng ycchenzheng force-pushed the chzheng/disruption_manager branch from 6363e4f to b65dfb7 Compare November 6, 2025 00:01
Comment on lines +28 to +30
class Framework(Enum):
PATHWAYS = "pathways"
MCJAX = "mcjax"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think CONTROLLER_TYPE or similar would be a better name here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants