Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR implements automated stalled Virtual Channel (VC) detection for NoC (Network-on-Chip) systems to identify problem cores when VCs are stalled, replacing a previously manual process.
- Adds a new
noc vc-stallCLI command for detecting stalled virtual channels - Implements traversal algorithm that follows error directions to identify the source core of VC stalls
- Provides detection across multiple NoC instances (NoC0 and NoC1)
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
| if vc_value != expected_vc_value: | ||
| invalid_bits_upper = _get_invalid_bits(expected_vc_value >> 16, vc_value >> 16) | ||
| problem_vc = invalid_bits_upper[0] | ||
| current_worker_has_error = True |
There was a problem hiding this comment.
The function assumes there will always be at least one invalid bit when vc_value != expected_vc_value, but if invalid_bits_upper is empty, accessing invalid_bits_upper[0] on line 430 will raise an IndexError.
| current_worker_has_error = True | |
| if invalid_bits_upper: | |
| problem_vc = invalid_bits_upper[0] | |
| current_worker_has_error = True |
| for direction_index, offset in enumerate(vc_register_offsets): | ||
| reg_addr = base_address + offset | ||
| vc_value = read_word_from_device( | ||
| current_worker, reg_addr, noc_id ^ 1 |
There was a problem hiding this comment.
[nitpick] The comment 'Use the other NoC to read the value' suggests this is intentional, but using 'noc_id ^ 1' is a magic operation that would be clearer as a named variable or function like 'get_alternate_noc_id(noc_id)'.
| current_worker, reg_addr, noc_id ^ 1 | |
| current_worker, reg_addr, get_alternate_noc_id(noc_id) |
| try: | ||
| return OnChipCoordinate(next_y % num_rows, next_x % num_cols, "logical", device, core_type) | ||
| except Exception as e: | ||
| util.VERBOSE(f"Could not move to ({next_x},{next_y}): {e}") |
There was a problem hiding this comment.
Using modulo operation on next_y and next_x assumes the grid wraps around, but this may not be correct for all chip topologies. The function should validate that the calculated coordinates are within valid bounds before creating the OnChipCoordinate.
| util.VERBOSE(f"Could not move to ({next_x},{next_y}): {e}") | |
| # Check bounds before creating the coordinate | |
| if 0 <= next_y < num_rows and 0 <= next_x < num_cols: | |
| try: | |
| return OnChipCoordinate(next_y, next_x, "logical", device, core_type) | |
| except Exception as e: | |
| util.VERBOSE(f"Could not move to ({next_x},{next_y}): {e}") | |
| return None | |
| else: | |
| util.VERBOSE(f"Next coordinate ({next_x},{next_y}) is out of bounds (rows: 0-{num_rows-1}, cols: 0-{num_cols-1})") |
| noc_ids: List of NOC IDs to check (0 and/or 1) | ||
| """ | ||
| # VC stall detection constants | ||
| noc_base_addresses = [0xFFB20300, 0xFFB30300] # NIU_BASE for NoC0, NIU_BASE for NoC1 |
There was a problem hiding this comment.
[nitpick] Hard-coded base addresses should be defined as named constants at the module level or retrieved from a configuration to improve maintainability and make them easier to update.
| noc_base_addresses = [0xFFB20300, 0xFFB30300] # NIU_BASE for NoC0, NIU_BASE for NoC1 | |
| noc_base_addresses = [NOC0_NIU_BASE, NOC1_NIU_BASE] # NIU_BASE for NoC0, NIU_BASE for NoC1 |
| num_rows = max(num_rows, logical_coords[0]) | ||
| num_cols = max(num_cols, logical_coords[1]) | ||
|
|
||
| return worker_locations, num_rows, num_cols |
There was a problem hiding this comment.
The grid size calculation is incorrect. Using max() on coordinates gives the maximum coordinate value, but grid size should be max_coordinate + 1 since coordinates are 0-indexed.
| return worker_locations, num_rows, num_cols | |
| return worker_locations, num_rows + 1, num_cols + 1 |
| noc all [-d <device>] [-l <loc>] [-s] | ||
| noc register (<reg-names> | --search <reg-pattern> [--max <max-regs>]) [-d <device>] [--noc <noc-id>] [-l <loc>] [-s] | ||
|
|
||
| noc vc-stall [-d <device>] [--noc <noc-id>] |
There was a problem hiding this comment.
Add description of the command in Description: section at line 24
| """ | ||
| # VC stall detection constants | ||
| noc_base_addresses = [0xFFB20300, 0xFFB30300] # NIU_BASE for NoC0, NIU_BASE for NoC1 | ||
| vc_register_offsets = [0x00000008, 0x00000010] # Y and X direction VC data |
There was a problem hiding this comment.
Try not to hard code register addresses nor offsets. Here are NIU register definitions:
- https://github.com/tenstorrent/tt-exalens/blob/main/ttexalens/hardware/wormhole/niu_registers.py#L17
- https://github.com/tenstorrent/tt-exalens/blob/main/ttexalens/hardware/blackhole/niu_registers.py#L17
- https://github.com/tenstorrent/tt-exalens/blob/main/ttexalens/hardware/quasar/niu_registers.py#L17
Add registers that you need there. Then you can use get_register_store method to fetch register store you need (mostly you would want to providenoc_id.
| num_cols = 0 | ||
|
|
||
| for block_loc in device.get_block_locations(block_type="functional_workers"): | ||
| worker_locations.append(block_loc) |
There was a problem hiding this comment.
You can assign worker_locations result of get_block_locations method, then loop over that list, instead of creating a new list and appending values to it...
| for noc_id in [0, 1]: | ||
| display_specific_noc_registers(loc, device, reg_names, noc_id, simple_print) | ||
|
|
||
| elif dopt.args["vc-stall"]: |
There was a problem hiding this comment.
If this command is not related to --loc, it shouldn't be placed in a loop:
for loc in dopt.for_each("--loc", context, ui_state, device=device):
| noc all [-d <device>] [-l <loc>] [-s] | ||
| noc register (<reg-names> | --search <reg-pattern> [--max <max-regs>]) [-d <device>] [--noc <noc-id>] [-l <loc>] [-s] | ||
|
|
||
| noc vc-stall [-d <device>] [--noc <noc-id>] |
There was a problem hiding this comment.
Currently, you are checking only functional_workers. What about dram, eth? Something else that might be interesting?
Implement NoC Virtual Channel stall detection command, used to identify the problem core when the VC is stalled. This process was done manually.