Add DAG for verifying TensorCore utilization metrics with jobset monitoring #1119
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR introduces a new DAG, jobset_uptime_validation, designed to automate the functional testing and verification of JobSet uptime metrics on GKE TPU clusters.
Workload Execution
JobSet Deployment: Deploys a JobSet with 4 completions and 1 replica. It uses the tpu-info:v0.5.1 image to run a JAX benchmark.
Metric Validation Logic
The validation is split into two phases:
Positive Test (Wait for Uptime)
Objective: Confirms that the uptime metric is successfully exported to Cloud Monitoring after the JobSet is applied.
Negative Test (Verify No Data)
Objective: Ensures that no "ghost data" or incorrect historical values persist during stability checks.
Tests
cloud-ml-auto-solutions)2.13.1Checklist
Before submitting this PR, please make sure (put X in square brackets):