-
Notifications
You must be signed in to change notification settings - Fork 29
Description
** Problem **
tt-smi fails when a long-running process using UMD is active. Namely, tt-telemetry is an application that uses UMD to poll L1 continuously (on a ~5 second loop) for various metrics: ARC, Ethernet link status, and more. We want to be able to safely deploy this service to customers but have discovered that it interferes with tt-smi reset.
** Description **
The issue was first observed here. Please note the stack trace there.
On LoudBox (n300-based) machines, if tt-telemetry avoids reading the non-MMIO chips, tt-smi reset tends to succeed. Otherwise, it fails with the error shown in the aforementioned issue or sometimes a different stack trace.
On a Wormhole Galaxy box, where all devices are MMIO-capable, tt-telemetry was started and then tt-smi -glx_reset was issued. This resulted in a similar failure of tt-smi:
btrzynadlowski@wh-glx-a03u02:~$ tt-smi -glx_reset
/opt/tenstorrent/tt-tools/tt-smi/.venv/lib/python3.10/site-packages/tt_smi/tt_smi.py:19: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
import pkg_resources
Resetting WH Galaxy trays with reset command...
Executing command: sudo ipmitool raw 0x30 0x8B 0xF 0xFF 0x0 0xF
Waiting for 30 seconds: 30
Driver loaded
Re-initializing boards after reset....
Detected Chips: 1
Error when re-initializing chips!
Chip initialization failed (aborted): Read 0xffffffff from ARC scratch[6]: you should reset the board.
Communication Status: Success
DRAM Status: In Progress, 0 out of 4 initialized
CPU Status: Timeout
ARC Status: In Progress, 0 out of 1 initialized
Ethernet Status: In Progress, 0 out of 16 initialized
Noc Safe: true
Unknown State: false
0: luwen_if::chip::hl_comms::HlCommsInterface::axi_read_field
1: luwen_if::chip::wormhole::Wormhole::check_arc_msg_safe
2: <luwen_if::chip::wormhole::Wormhole as luwen_if::chip::ChipImpl>::update_init_state
3: <luwen_if::chip::Chip as luwen_if::chip::ChipImpl>::update_init_state
4: luwen_if::chip::init::wait_for_init
5: luwen_if::detect_chips::detect_chips
6: pyluwen::detect_chips_fallible
7: pyluwen::_::__pyfunction_detect_chips_fallible
8: pyo3::impl_::trampoline::trampoline
9: pyluwen::_::<impl pyluwen::detect_chips_fallible::MakeDef>::DEF::trampoline
10: <unknown>
11: _PyEval_EvalFrameDefault
12: _PyFunction_Vectorcall
13: _PyEval_EvalFrameDefault
14: _PyFunction_Vectorcall
15: _PyEval_EvalFrameDefault
16: _PyFunction_Vectorcall
17: _PyEval_EvalFrameDefault
18: <unknown>
19: PyEval_EvalCode
20: <unknown>
21: <unknown>
22: <unknown>
23: _PyRun_SimpleFileObject
24: _PyRun_AnyFileObject
25: Py_RunMain
26: Py_BytesMain
27: __libc_start_call_main
at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
28: __libc_start_main_impl
at ./csu/../csu/libc-start.c:392:3
29: _start
This is a blocker for tt-telemetry and will likely soon become a blocker for the upcoming fabric manager service.