-
Notifications
You must be signed in to change notification settings - Fork 347
Closed as not planned
Labels
Description
Problem Statement
Tests are failing consistently on the latest main branch with a device timeout error on BH GLX systems.
Background Information
- Context: Users (Yu Gao, Allan Liu) are reporting consistent failures on Blackhole Galaxy (BH GLX) systems when running tests on the latest
mainbranch. - Regression Status: This is identified as a regression. The failure rate has increased from intermittent (10-20%) to consistent (100%) in the ~100 commits preceding Jan 30, 2026.
- Environment: BH GLX (Observed on both single chip configuration and 8-chip submesh).
- Impact: High - Currently blocking tests on BH GLX; failure rate is effectively 100%.
Example / Logs
The system fails to initialize, throwing a Timeout (10000 ms) error waiting for physical cores, followed by a firmware initialization failure.
Key Logs:
Fabric | TopologyMapper: Using 2 pinning(s) for mesh 0...
2026-01-30 15:20:35.843 | error | Metal | Timeout detected (metal_context.cpp:1921)
2026-01-30 15:20:35.843 | critical | Always | TT_THROW: Device 2: Timeout (10000 ms) waiting for physical cores to finish: (x=29,y=25). (assert.hpp:104)
2026-01-30 15:20:35.844 | critical | Always | TT_THROW: Device 2 init: failed to initialize FW! Try resetting the board. (assert.hpp:104)
Expected Behaviour
The device should initialize correctly, and tests should pass without timeouts on the main branch, restoring stability to the BH GLX CI pipeline.
Testing / Steps to Reproduce
- Access a BH GLX system.
- Checkout the latest
mainbranch oftenstorrent/tt-metal. - Run standard fabric/metal tests (e.g., as observed by Yu Gao on an 8-chip submesh or Allan Liu on chip 0).
- Observe the timeout error during device initialization.
Reference: Slack Thread
Reactions are currently unavailable