Add qb2-blackhole support to Nightly CI support + 4x TP tests in onPush/onPR#3174
Add qb2-blackhole support to Nightly CI support + 4x TP tests in onPush/onPR#3174
Conversation
- Add qb2 specific nightly CI jobs (main branch only, limited CI resources) - Add qb2-blackhole to ALLOWED_ARCHES and default_archs in conftest.py - Add qb2-blackhole to supported_archs in ~450 test config entries - Add ~65 explicit qb2-blackhole arch_overrides - Add model-test-xfail-qb2.json - Add qb2-blackhole arch_overrides for 2x lower PCC (0.98) and 8x s3-bucket-missing fails
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3174 +/- ##
==========================================
- Coverage 28.38% 28.37% -0.01%
==========================================
Files 31 33 +2
Lines 4154 4088 -66
==========================================
- Hits 1179 1160 -19
+ Misses 2975 2928 -47 ☔ View full report in Codecov by Sentry. |
|
I wouldn't include this in nightly tests until we get more stable setup with some redundancy (at least 3 runners with some label) as we use nightly as regression gate for release and this can block us. |
- Create new schedule-nightly-qb2.yml workflow for QB2-specific nightly tests to reduce risk on official nightly job while we only have single CI machine - Remove QB2 jobs from main schedule-nightly.yml workflow - Update workflow-run-collect-data.yml to collect data from "On nightly QB2" workflow - QB2 nightly workflow runs at same time as main nightly (cron: '0 0 * * *') - Includes both model-test-passing-qb2.json and model-test-xfail-qb2.json jobs
|
Thanks @vmilosevic , I pushed change here to split these tests to their own dedicated QB2 nightly job after offline discussion (TLDR; folks want these stable/expected passing tests reported on in superset and experimental nightly is for not-stable/not-expected-passing models and isn't reported on), take another look? |
|
Do we need both single_chip and data_parallel for all the models, as graph difference should only be in |
Ticket
None
Problem description
What's changed
Checklist