-
Notifications
You must be signed in to change notification settings - Fork 29
Open
Description
tt-smi -r on a single machine as well as on multiple devices result in extremely long reset and error at the end:
Script that I use to run on a multiple hosts:
#!/bin/bash
# List of hostnames or IPs
HOSTS=(
ttuser@10.140.20.204
ttuser@10.140.20.210
ttuser@10.140.20.213
ttuser@10.140.20.205
ttuser@10.140.20.214
)
REMOTE_CMD="tt-smi -r"
PIDS=()
# Define a cleanup function that kills background jobs
cleanup() {
echo "Caught signal. Killing background jobs..."
for pid in "${PIDS[@]}"; do
kill "$pid" 2>/dev/null
done
exit 1
}
# Trap INT (Ctrl+C) and TERM signals
trap cleanup INT TERM
# Launch SSH commands in background and track their PIDs
for HOST in "${HOSTS[@]}"; do
echo "Running on $HOST..."
ssh -o StrictHostKeyChecking=no "$HOST" "$REMOTE_CMD" &
PIDS+=($!)
done
# Wait for all to complete
wait
echo "Done."
~
Error message:
Running on ttuser@10.140.20.204...
Running on ttuser@10.140.20.210...
Running on ttuser@10.140.20.213...
Running on ttuser@10.140.20.205...
Running on ttuser@10.140.20.214...
Starting PCI link reset on WH devices at PCI indices: 0, 1, 2, 3
Finishing PCI link reset on WH devices at PCI indices: 0, 1, 2, 3
Re-initializing boards after reset....
Detected Chips: 1
Starting PCI link reset on WH devices at PCI indices: 0, 1, 2, 3
Finishing PCI link reset on WH devices at PCI indices: 0, 1, 2, 3
Re-initializing boards after reset....
Detected Chips: 1
Starting PCI link reset on WH devices at PCI indices: 0, 1, 2, 3
Finishing PCI link reset on WH devices at PCI indices: 0, 1, 2, 3
Re-initializing boards after reset....
Detected Chips: 1
Starting PCI link reset on WH devices at PCI indices: 0, 1, 2, 3
Finishing PCI link reset on WH devices at PCI indices: 0, 1, 2, 3
Re-initializing boards after reset....
Detected Chips: 1
Detecting ARC: |
Detecting DRAM: |
[0/900] [0/16] ETH: [0, 0, 1, 0]: Waiting for initial training to complete: |
Starting PCI link reset on WH devices at PCI indices: 0, 1, 2, 3
Finishing PCI link reset on WH devices at PCI indices: 0, 1, 2, 3
Re-initializing boards after reset....
Detected Chips: 1
Detecting ARC: |
Detecting DRAM: |
[] [0/16] ETH: |
Error when re-initializing chips!
Chip initialization failed:
Communication Status: Success
DRAM Status: Timeout, 4 out of 4 initialized
CPU Status: Timeout
ARC Status: Timeout, 1 out of 1 initialized
Ethernet Status: Timeout, 0 out of 16 initialized
Noc Safe: false
Unknown State: false
0: luwen_if::chip::init::wait_for_init
1: luwen_if::detect_chips::detect_chips
2: pyluwen::detect_chips_fallible
3: pyluwen::_::__pyfunction_detect_chips_fallible
4: pyo3::impl_::trampoline::trampoline
5: pyluwen::_::<impl pyluwen::detect_chips_fallible::MakeDef>::DEF::trampoline
6: <unknown>
7: _PyEval_EvalFrameDefault
8: _PyFunction_Vectorcall
9: _PyEval_EvalFrameDefault
10: _PyFunction_Vectorcall
11: _PyEval_EvalFrameDefault
12: _PyFunction_Vectorcall
13: _PyEval_EvalFrameDefault
14: <unknown>
15: PyEval_EvalCode
16: <unknown>
17: <unknown>
18: <unknown>
19: _PyRun_SimpleFileObject
20: _PyRun_AnyFileObject
21: Py_RunMain
22: Py_BytesMain
Detected Chips: 2
Detecting ARC: \
Detecting DRAM: \
[] [0/16] ETH: \
Error when re-initializing chips!
Chip initialization failed:
Communication Status: Success
DRAM Status: Timeout, 4 out of 4 initialized
CPU Status: Timeout
ARC Status: Timeout, 1 out of 1 initialized
Ethernet Status: Timeout, 0 out of 16 initialized
Noc Safe: false
Unknown State: false
0: luwen_if::chip::init::wait_for_init
1: luwen_if::detect_chips::detect_chips
2: pyluwen::detect_chips_fallible
3: pyluwen::_::__pyfunction_detect_chips_fallible
4: pyo3::impl_::trampoline::trampoline
5: pyluwen::_::<impl pyluwen::detect_chips_fallible::MakeDef>::DEF::trampoline
6: <unknown>
7: _PyEval_EvalFrameDefault
8: _PyFunction_Vectorcall
9: _PyEval_EvalFrameDefault
10: _PyFunction_Vectorcall
11: _PyEval_EvalFrameDefault
12: _PyFunction_Vectorcall
13: _PyEval_EvalFrameDefault
14: <unknown>
15: PyEval_EvalCode
16: <unknown>
17: <unknown>
18: <unknown>
19: _PyRun_SimpleFileObject
20: _PyRun_AnyFileObject
21: Py_RunMain
22: Py_BytesMain
23: <unknown>
24: __libc_start_main
25: _start
Done.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels