-
Notifications
You must be signed in to change notification settings - Fork 3
Add nightly scale tests for self-hosted runners #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 7 commits
a0699f7
1fce65a
dace222
d9e4c47
eb6df84
1b7ab90
c0cd0c5
766a39c
eb58999
5132a5e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| # Nightly Stress Test for self-hosted runners | ||
| name: Self-hosted Runners Nightly Stress Test | ||
| on: | ||
| schedule: | ||
| # Triggers at 11pm every night. | ||
|
||
| - cron: '0 23 * * *' | ||
|
||
| # Cancel any previous iterations if a new commit is pushed. | ||
| concurrency: | ||
| group: ${{ github.workflow }}-${{ github.ref }} | ||
| cancel-in-progress: true | ||
| jobs: | ||
| nightly-stress-test: | ||
| name: "Stress Test ${{ matrix.runners }} - ${{ matrix.instances }}" | ||
| strategy: | ||
| fail-fast: false # don't cancel all jobs on failure | ||
| matrix: | ||
| instances: [1, 2, 3, 4, 5] | ||
| runners: ["arc-linux-x86-n2-64", "arc-linux-x86-n2-128", "arc-linux-arm64-t2a-48", "arc-linux-x86-g2-96-l4-8gpu", "arc-linux-x86-ct5lp-224-8tpu"] | ||
| # TODO: Needs final runs-on value | ||
| runs-on: ${{ matrix.runners }} | ||
| container: | ||
| image: ${{ (contains(matrix.runners, 't2a') && 'us-central1-docker.pkg.dev/tensorflow-sigs/tensorflow/build-arm64:jax-latest-multi-python') || 'index.docker.io/tensorflow/build@sha256:7fb38f0319bda36393cad7f40670aa22352b44421bb906f5cf34d543acd8e1d2' }} | ||
| timeout-minutes: 10 | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you know, does this timeout include the initialization of the runner? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it doesn't because it's supposed to be the execution time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In that case its fine |
||
| defaults: | ||
| run: | ||
| shell: bash -ex {0} | ||
| steps: | ||
| - uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # ratchet:actions/checkout@v4 | ||
| - name: Install JAX test requirements | ||
|
||
| run: | | ||
| pip install -U -r build/test-requirements.txt | ||
| - name: DEBUG HALT | ||
| run: | | ||
| echo "Halting" | ||
| sleep 5m | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of "stress" maybe we should call this a scale test. We don't actually do anything with the runners which makes me lean a little away from calling it stress long term.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to scale.