Skip to content

tt-smi reset crashes and fails when run simultaneously with a UMD-based app #172

@btrzynadlowski-tt

Description

@btrzynadlowski-tt

** Problem **

tt-smi fails when a long-running process using UMD is active. Namely, tt-telemetry is an application that uses UMD to poll L1 continuously (on a ~5 second loop) for various metrics: ARC, Ethernet link status, and more. We want to be able to safely deploy this service to customers but have discovered that it interferes with tt-smi reset.

** Description **

The issue was first observed here. Please note the stack trace there.

On LoudBox (n300-based) machines, if tt-telemetry avoids reading the non-MMIO chips, tt-smi reset tends to succeed. Otherwise, it fails with the error shown in the aforementioned issue or sometimes a different stack trace.

On a Wormhole Galaxy box, where all devices are MMIO-capable, tt-telemetry was started and then tt-smi -glx_reset was issued. This resulted in a similar failure of tt-smi:

btrzynadlowski@wh-glx-a03u02:~$ tt-smi -glx_reset
/opt/tenstorrent/tt-tools/tt-smi/.venv/lib/python3.10/site-packages/tt_smi/tt_smi.py:19: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  import pkg_resources
 Resetting WH Galaxy trays with reset command... 
Executing command: sudo ipmitool raw 0x30 0x8B 0xF 0xFF 0x0 0xF
Waiting for 30 seconds: 30
Driver loaded
 Re-initializing boards after reset.... 
 Detected Chips: 1
 Error when re-initializing chips!
 Chip initialization failed (aborted): Read 0xffffffff from ARC scratch[6]: you should reset the board. 
   Communication Status: Success
   DRAM Status: In Progress, 0 out of 4 initialized
   CPU Status: Timeout
   ARC Status: In Progress, 0 out of 1 initialized
   Ethernet Status: In Progress, 0 out of 16 initialized
   Noc Safe: true
   Unknown State: false

   0: luwen_if::chip::hl_comms::HlCommsInterface::axi_read_field
   1: luwen_if::chip::wormhole::Wormhole::check_arc_msg_safe
   2: <luwen_if::chip::wormhole::Wormhole as luwen_if::chip::ChipImpl>::update_init_state
   3: <luwen_if::chip::Chip as luwen_if::chip::ChipImpl>::update_init_state
   4: luwen_if::chip::init::wait_for_init
   5: luwen_if::detect_chips::detect_chips
   6: pyluwen::detect_chips_fallible
   7: pyluwen::_::__pyfunction_detect_chips_fallible
   8: pyo3::impl_::trampoline::trampoline
   9: pyluwen::_::<impl pyluwen::detect_chips_fallible::MakeDef>::DEF::trampoline
  10: <unknown>
  11: _PyEval_EvalFrameDefault
  12: _PyFunction_Vectorcall
  13: _PyEval_EvalFrameDefault
  14: _PyFunction_Vectorcall
  15: _PyEval_EvalFrameDefault
  16: _PyFunction_Vectorcall
  17: _PyEval_EvalFrameDefault
  18: <unknown>
  19: PyEval_EvalCode
  20: <unknown>
  21: <unknown>
  22: <unknown>
  23: _PyRun_SimpleFileObject
  24: _PyRun_AnyFileObject
  25: Py_RunMain
  26: Py_BytesMain
  27: __libc_start_call_main
             at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  28: __libc_start_main_impl
             at ./csu/../csu/libc-start.c:392:3
  29: _start

This is a blocker for tt-telemetry and will likely soon become a blocker for the upcoming fabric manager service.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions