Skip to content

Implement stalled VCs detection#557

Open
nstamatovicTT wants to merge 1 commit intomainfrom
nstamatovic/noc_vc_stall_detection
Open

Implement stalled VCs detection#557
nstamatovicTT wants to merge 1 commit intomainfrom
nstamatovic/noc_vc_stall_detection

Conversation

@nstamatovicTT
Copy link
Collaborator

Implement NoC Virtual Channel stall detection command, used to identify the problem core when the VC is stalled. This process was done manually.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements automated stalled Virtual Channel (VC) detection for NoC (Network-on-Chip) systems to identify problem cores when VCs are stalled, replacing a previously manual process.

  • Adds a new noc vc-stall CLI command for detecting stalled virtual channels
  • Implements traversal algorithm that follows error directions to identify the source core of VC stalls
  • Provides detection across multiple NoC instances (NoC0 and NoC1)

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

if vc_value != expected_vc_value:
invalid_bits_upper = _get_invalid_bits(expected_vc_value >> 16, vc_value >> 16)
problem_vc = invalid_bits_upper[0]
current_worker_has_error = True
Copy link

Copilot AI Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function assumes there will always be at least one invalid bit when vc_value != expected_vc_value, but if invalid_bits_upper is empty, accessing invalid_bits_upper[0] on line 430 will raise an IndexError.

Suggested change
current_worker_has_error = True
if invalid_bits_upper:
problem_vc = invalid_bits_upper[0]
current_worker_has_error = True

Copilot uses AI. Check for mistakes.
for direction_index, offset in enumerate(vc_register_offsets):
reg_addr = base_address + offset
vc_value = read_word_from_device(
current_worker, reg_addr, noc_id ^ 1
Copy link

Copilot AI Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment 'Use the other NoC to read the value' suggests this is intentional, but using 'noc_id ^ 1' is a magic operation that would be clearer as a named variable or function like 'get_alternate_noc_id(noc_id)'.

Suggested change
current_worker, reg_addr, noc_id ^ 1
current_worker, reg_addr, get_alternate_noc_id(noc_id)

Copilot uses AI. Check for mistakes.
try:
return OnChipCoordinate(next_y % num_rows, next_x % num_cols, "logical", device, core_type)
except Exception as e:
util.VERBOSE(f"Could not move to ({next_x},{next_y}): {e}")
Copy link

Copilot AI Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using modulo operation on next_y and next_x assumes the grid wraps around, but this may not be correct for all chip topologies. The function should validate that the calculated coordinates are within valid bounds before creating the OnChipCoordinate.

Suggested change
util.VERBOSE(f"Could not move to ({next_x},{next_y}): {e}")
# Check bounds before creating the coordinate
if 0 <= next_y < num_rows and 0 <= next_x < num_cols:
try:
return OnChipCoordinate(next_y, next_x, "logical", device, core_type)
except Exception as e:
util.VERBOSE(f"Could not move to ({next_x},{next_y}): {e}")
return None
else:
util.VERBOSE(f"Next coordinate ({next_x},{next_y}) is out of bounds (rows: 0-{num_rows-1}, cols: 0-{num_cols-1})")

Copilot uses AI. Check for mistakes.
noc_ids: List of NOC IDs to check (0 and/or 1)
"""
# VC stall detection constants
noc_base_addresses = [0xFFB20300, 0xFFB30300] # NIU_BASE for NoC0, NIU_BASE for NoC1
Copy link

Copilot AI Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Hard-coded base addresses should be defined as named constants at the module level or retrieved from a configuration to improve maintainability and make them easier to update.

Suggested change
noc_base_addresses = [0xFFB20300, 0xFFB30300] # NIU_BASE for NoC0, NIU_BASE for NoC1
noc_base_addresses = [NOC0_NIU_BASE, NOC1_NIU_BASE] # NIU_BASE for NoC0, NIU_BASE for NoC1

Copilot uses AI. Check for mistakes.
num_rows = max(num_rows, logical_coords[0])
num_cols = max(num_cols, logical_coords[1])

return worker_locations, num_rows, num_cols
Copy link

Copilot AI Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The grid size calculation is incorrect. Using max() on coordinates gives the maximum coordinate value, but grid size should be max_coordinate + 1 since coordinates are 0-indexed.

Suggested change
return worker_locations, num_rows, num_cols
return worker_locations, num_rows + 1, num_cols + 1

Copilot uses AI. Check for mistakes.
noc all [-d <device>] [-l <loc>] [-s]
noc register (<reg-names> | --search <reg-pattern> [--max <max-regs>]) [-d <device>] [--noc <noc-id>] [-l <loc>] [-s]

noc vc-stall [-d <device>] [--noc <noc-id>]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add description of the command in Description: section at line 24

"""
# VC stall detection constants
noc_base_addresses = [0xFFB20300, 0xFFB30300] # NIU_BASE for NoC0, NIU_BASE for NoC1
vc_register_offsets = [0x00000008, 0x00000010] # Y and X direction VC data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try not to hard code register addresses nor offsets. Here are NIU register definitions:

num_cols = 0

for block_loc in device.get_block_locations(block_type="functional_workers"):
worker_locations.append(block_loc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can assign worker_locations result of get_block_locations method, then loop over that list, instead of creating a new list and appending values to it...

for noc_id in [0, 1]:
display_specific_noc_registers(loc, device, reg_names, noc_id, simple_print)

elif dopt.args["vc-stall"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this command is not related to --loc, it shouldn't be placed in a loop:

for loc in dopt.for_each("--loc", context, ui_state, device=device):

noc all [-d <device>] [-l <loc>] [-s]
noc register (<reg-names> | --search <reg-pattern> [--max <max-regs>]) [-d <device>] [--noc <noc-id>] [-l <loc>] [-s]

noc vc-stall [-d <device>] [--noc <noc-id>]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, you are checking only functional_workers. What about dram, eth? Something else that might be interesting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants