Skip to content

tt-smi -r for metal-wh-01 (03, 05, 04, 06) #148

@rfurko-tt

Description

@rfurko-tt

tt-smi -r on a single machine as well as on multiple devices result in extremely long reset and error at the end:

Script that I use to run on a multiple hosts:

#!/bin/bash

# List of hostnames or IPs
HOSTS=(
    ttuser@10.140.20.204
    ttuser@10.140.20.210
    ttuser@10.140.20.213
    ttuser@10.140.20.205
    ttuser@10.140.20.214
)

REMOTE_CMD="tt-smi -r"
PIDS=()

# Define a cleanup function that kills background jobs
cleanup() {
    echo "Caught signal. Killing background jobs..."
    for pid in "${PIDS[@]}"; do
        kill "$pid" 2>/dev/null
    done
    exit 1
}

# Trap INT (Ctrl+C) and TERM signals
trap cleanup INT TERM

# Launch SSH commands in background and track their PIDs
for HOST in "${HOSTS[@]}"; do
    echo "Running on $HOST..."
    ssh -o StrictHostKeyChecking=no "$HOST" "$REMOTE_CMD" &
    PIDS+=($!)
done

# Wait for all to complete
wait
echo "Done."
~              

Error message:

Running on ttuser@10.140.20.204...
Running on ttuser@10.140.20.210...
Running on ttuser@10.140.20.213...
Running on ttuser@10.140.20.205...
Running on ttuser@10.140.20.214...
 Starting PCI link reset on WH devices at PCI indices: 0, 1, 2, 3 
 Finishing PCI link reset on WH devices at PCI indices: 0, 1, 2, 3 
 Re-initializing boards after reset.... 
 Detected Chips: 1
 Starting PCI link reset on WH devices at PCI indices: 0, 1, 2, 3 
 Finishing PCI link reset on WH devices at PCI indices: 0, 1, 2, 3 
 Re-initializing boards after reset.... 
 Detected Chips: 1
 Starting PCI link reset on WH devices at PCI indices: 0, 1, 2, 3 
 Finishing PCI link reset on WH devices at PCI indices: 0, 1, 2, 3 
 Re-initializing boards after reset.... 
 Detected Chips: 1
 Starting PCI link reset on WH devices at PCI indices: 0, 1, 2, 3 
 Finishing PCI link reset on WH devices at PCI indices: 0, 1, 2, 3 
 Re-initializing boards after reset.... 
 Detected Chips: 1
 Detecting ARC: |
 Detecting DRAM: |
 [0/900] [0/16] ETH: [0, 0, 1, 0]: Waiting for initial training to complete: |
 Starting PCI link reset on WH devices at PCI indices: 0, 1, 2, 3 
 Finishing PCI link reset on WH devices at PCI indices: 0, 1, 2, 3 
 Re-initializing boards after reset.... 
 Detected Chips: 1
 Detecting ARC: |
 Detecting DRAM: |
 [] [0/16] ETH: |
 Error when re-initializing chips!
 Chip initialization failed:
   Communication Status: Success
   DRAM Status: Timeout, 4 out of 4 initialized
   CPU Status: Timeout
   ARC Status: Timeout, 1 out of 1 initialized
   Ethernet Status: Timeout, 0 out of 16 initialized
   Noc Safe: false
   Unknown State: false
 
   0: luwen_if::chip::init::wait_for_init
   1: luwen_if::detect_chips::detect_chips
   2: pyluwen::detect_chips_fallible
   3: pyluwen::_::__pyfunction_detect_chips_fallible
   4: pyo3::impl_::trampoline::trampoline
   5: pyluwen::_::<impl pyluwen::detect_chips_fallible::MakeDef>::DEF::trampoline
   6: <unknown>
   7: _PyEval_EvalFrameDefault
   8: _PyFunction_Vectorcall
   9: _PyEval_EvalFrameDefault
  10: _PyFunction_Vectorcall
  11: _PyEval_EvalFrameDefault
  12: _PyFunction_Vectorcall
  13: _PyEval_EvalFrameDefault
  14: <unknown>
  15: PyEval_EvalCode
  16: <unknown>
  17: <unknown>
  18: <unknown>
  19: _PyRun_SimpleFileObject
  20: _PyRun_AnyFileObject
  21: Py_RunMain
  22: Py_BytesMain
 Detected Chips: 2
 Detecting ARC: \
 Detecting DRAM: \
 [] [0/16] ETH: \
 Error when re-initializing chips!
 Chip initialization failed:
   Communication Status: Success
   DRAM Status: Timeout, 4 out of 4 initialized
   CPU Status: Timeout
   ARC Status: Timeout, 1 out of 1 initialized
   Ethernet Status: Timeout, 0 out of 16 initialized
   Noc Safe: false
   Unknown State: false
 
   0: luwen_if::chip::init::wait_for_init
   1: luwen_if::detect_chips::detect_chips
   2: pyluwen::detect_chips_fallible
   3: pyluwen::_::__pyfunction_detect_chips_fallible
   4: pyo3::impl_::trampoline::trampoline
   5: pyluwen::_::<impl pyluwen::detect_chips_fallible::MakeDef>::DEF::trampoline
   6: <unknown>
   7: _PyEval_EvalFrameDefault
   8: _PyFunction_Vectorcall
   9: _PyEval_EvalFrameDefault
  10: _PyFunction_Vectorcall
  11: _PyEval_EvalFrameDefault
  12: _PyFunction_Vectorcall
  13: _PyEval_EvalFrameDefault
  14: <unknown>
  15: PyEval_EvalCode
  16: <unknown>
  17: <unknown>
  18: <unknown>
  19: _PyRun_SimpleFileObject
  20: _PyRun_AnyFileObject
  21: Py_RunMain
  22: Py_BytesMain
  23: <unknown>
  24: __libc_start_main
  25: _start
 
Done.

┆Issue is synchronized with this Jira Task by Unito

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions