Skip to content

Conversation

@chengpinglin
Copy link
Contributor

Description

This PR introduces a new DAG, jobset_uptime_validation, designed to automate the functional testing and verification of JobSet uptime metrics on GKE TPU clusters.

  • Workload Execution
    JobSet Deployment: Deploys a JobSet with 4 completions and 1 replica. It uses the tpu-info:v0.5.1 image to run a JAX benchmark.

  • Metric Validation Logic
    The validation is split into two phases:

Positive Test (Wait for Uptime)
Objective: Confirms that the uptime metric is successfully exported to Cloud Monitoring after the JobSet is applied.

Negative Test (Verify No Data)
Objective: Ensures that no "ghost data" or incorrect historical values persist during stability checks.

Tests

  • GCP Composer name: tony-test (under GCP project: cloud-ml-auto-solutions)
  • GCP Composer version: 2.13.1

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run one-shot tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

…toring

This PR introduces a new DAG, `jobset_uptime_validation`, designed to automate the functional testing and verification of JobSet uptime metrics on GKE TPU clusters.

- Workload Execution
   JobSet Deployment: Deploys a JobSet with 4 completions and 1 replica. It uses the tpu-info:v0.5.1 image to run a JAX benchmark.

- Metric Validation Logic
   The validation is split into two phases:

Positive Test (Wait for Uptime):
Objective: Confirms that the uptime metric is successfully exported to Cloud Monitoring after the JobSet is applied.

Negative Test (Verify No Data):
Objective: Ensures that no "ghost data" or incorrect historical values persist during stability checks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant