-
Notifications
You must be signed in to change notification settings - Fork 347
Open
Labels
Description
Problem Description
The SDXL unet_loop test failed in L2 nightly due to hitting the 3000s timeout - link, but passed on rerun: link
After digging in, it looks like this is machine-dependent.
On most CI machines the test completes in ~33 minutes, while on tt-metal-ci-vm-14 it consistently takes closer to 50 minutes, which pushes it over the timeout this time.
- Sheet with job links, test completion times, machine names: [link] (https://docs.google.com/spreadsheets/d/1ylgOlSw8Rbg3PUevZbW3PkL0fOsYlhnhaLFZL3k600k/edit?usp=sharing)
| Date | Machine | base_unet | refiner_unet | Job |
|---|---|---|---|---|
| 10 Feb 11:21 AM | tt-metal-ci-vm-27 | 1980s | 952s | https://github.com/tenstorrent/tt-metal/actions/runs/21854126492/job/63087302056 |
| 10 Feb 7:21 AM | tt-metal-ci-vm-14 | >3000s | - | https://github.com/tenstorrent/tt-metal/actions/runs/21854126492/job/63067635989 |
| 9 Feb 7:21 AM | tt-metal-ci-vm-71 | 2003s | 967s | https://github.com/tenstorrent/tt-metal/actions/runs/21814599942/job/62947574846 |
| 8 Feb 7:14 AM | tt-metal-ci-vm-14 | 2950s | 1449s | https://github.com/tenstorrent/tt-metal/actions/runs/21793495785/job/62877313068 |
| 7 Feb 7:09 AM | tt-metal-ci-vm-97 | 2111s | 979s | - |
| 6 Feb 7:15 AM | tt-metal-ci-vm-121 | 1957s | 958s | - |
| 5 Feb 7:17 AM | tt-metal-ci-vm-166 | 2235s | 869s | - |
Comment
Potential Solution is to increase the timeout in test - but it is unclear should we do it because 50min is already a lot.
Reactions are currently unavailable