Skip to content

Conversation

bwelton
Copy link
Contributor

@bwelton bwelton commented Oct 17, 2025

…ures

This change addresses an issue where counter dimensions were incorrectly shared across all GPU agents with the same architecture name, even when those agents had different hardware configurations (e.g., different CU counts).

Changes:

  • Updated getBlockDimensions() to accept agent ID instead of architecture name
  • Made dimension cache agent-specific instead of architecture-specific
  • Updated set_dimensions() in AST evaluation to use specific agent ID
  • Modified all API functions to handle agent-specific dimension lookups
  • Updated tests to work with agent-specific dimensions

This fix ensures that dimensions accurately reflect the actual hardware configuration of each individual GPU agent, preventing dimension mismatches in multi-GPU systems where GPUs share the same architecture but have different physical configurations.

Counter ID Representation Changes:

  • Modified counter_id encoding to include agent information in bits 37-32
  • Agent logical_node_id is encoded as (value + 1) to ensure agent 0 is detectable
  • Counter records internally store only 16-bit base metric IDs (bits 15-0)
  • Tool reconstructs agent-encoded counter IDs from base metric ID & agent info
  • Instance record counter_id field uses bitwise AND mask to extract base metric ID (counter_id.handle & 0xFFFF) to fit in 16-bit storage
  • Output generators (CSV, JSON, Perfetto) use agent-encoded IDs for consistency
  • Updated counter_config.cpp and metrics.cpp to extract base metric ID when needed
  • All counter lookups now properly handle agent-encoded vs base metric IDs

This ensures counter IDs are consistent between metadata and output records while maintaining compact storage in instance records.

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

@Copilot Copilot AI review requested due to automatic review settings October 17, 2025 04:12
…ures

This change addresses an issue where counter dimensions were incorrectly
shared across all GPU agents with the same architecture name, even when
those agents had different hardware configurations (e.g., different CU counts).

Changes:
- Updated getBlockDimensions() to accept agent ID instead of architecture name
- Made dimension cache agent-specific instead of architecture-specific
- Updated set_dimensions() in AST evaluation to use specific agent ID
- Modified all API functions to handle agent-specific dimension lookups
- Updated tests to work with agent-specific dimensions

This fix ensures that dimensions accurately reflect the actual hardware
configuration of each individual GPU agent, preventing dimension mismatches
in multi-GPU systems where GPUs share the same architecture but have
different physical configurations.

Counter ID Representation Changes:
- Modified counter_id encoding to include agent information in bits 37-32
- Agent logical_node_id is encoded as (value + 1) to ensure agent 0 is detectable
- Counter records internally store only 16-bit base metric IDs (bits 15-0)
- Tool reconstructs agent-encoded counter IDs from base metric ID & agent info
- Instance record counter_id field uses bitwise AND mask to extract base metric ID
  (counter_id.handle & 0xFFFF) to fit in 16-bit storage
- Output generators (CSV, JSON, Perfetto) use agent-encoded IDs for consistency
- Updated counter_config.cpp and metrics.cpp to extract base metric ID when needed
- All counter lookups now properly handle agent-encoded vs base metric IDs

This ensures counter IDs are consistent between metadata and output records while
maintaining compact storage in instance records.

Co-Authored-By: Claude <[email protected]>
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes dimension mismatches in multi-GPU systems where GPUs share the same architecture name but have different hardware configurations (e.g., different CU counts). The fix makes counter dimensions and IDs agent-specific rather than architecture-specific, ensuring accurate hardware representation per GPU.

Key changes:

  • Counter IDs now encode agent information (bits 37-32) to uniquely identify counters per agent
  • Dimension caching and lookup changed from architecture-based to agent-specific
  • Counter ID encoding/decoding utilities added to handle agent-specific IDs throughout the system

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
metrics_test.cpp Updated tests to handle agent-specific dimensions through API calls instead of direct cache access
dimension.cpp Modified to use agent ID instead of architecture name for dimension lookups
core.cpp Updated agent iteration test to extract base metric IDs from agent-encoded counter IDs
metrics.hpp/cpp Added agent-specific metric retrieval and counter ID encoding logic
id_decode.hpp Introduced counter ID encoding/decoding functions for agent-specific IDs
evaluate_ast.hpp/cpp Updated AST dimension evaluation to accept agent ID parameter
dimensions.hpp/cpp Changed dimension cache and block dimension functions to be agent-specific
counters.cpp Updated all API functions to handle agent-encoded counter IDs
counter_config.cpp Modified to extract base metric IDs from agent-encoded counter IDs
helpers.cpp Removed obsolete v1 dimension comparison test
tool.cpp Added logic to reconstruct agent-encoded counter IDs from instance records
generatePerfetto.cpp Added clarifying comments about agent-encoded IDs in output
generateCSV.cpp Added clarifying comments about agent-encoded IDs in output
print_functional_counters_client.cpp Updated to query instance count for specific agent

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +127 to +129
using DimSync =
common::Synchronized<std::unordered_map<uint64_t, std::shared_ptr<const metric_dims>>>;
static DimSync*& dim_data = common::static_object<DimSync>::construct();
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The static dimension cache is now keyed by agent_id.handle but is never cleared or bounded. In systems with many agents or dynamic agent addition/removal, this could lead to unbounded memory growth. Consider implementing a cleanup mechanism or documenting the memory implications.

Copilot uses AI. Check for mistakes.

Comment on lines +188 to +189
CHECK(agent_logical_node_id < (1 << AGENT_BIT_LENGTH))
<< "Agent logical_node_id exceeds 6-bit limit (max 63)";
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check allows values 0-63 (64 values), but the code adds AGENT_ENCODING_OFFSET (1) to the logical_node_id before encoding. This means logical_node_id 63 + 1 = 64, which exceeds the 6-bit limit (max value 63). The check should be agent_logical_node_id < (1 << AGENT_BIT_LENGTH) - AGENT_ENCODING_OFFSET or equivalent to account for the offset.

Suggested change
CHECK(agent_logical_node_id < (1 << AGENT_BIT_LENGTH))
<< "Agent logical_node_id exceeds 6-bit limit (max 63)";
CHECK(agent_logical_node_id < ((1 << AGENT_BIT_LENGTH) - AGENT_ENCODING_OFFSET))
<< "Agent logical_node_id + AGENT_ENCODING_OFFSET exceeds 6-bit limit (max 63)";

Copilot uses AI. Check for mistakes.

Comment on lines +360 to +363
// Encode agent into counter ID using agent's logical_node_id + AGENT_ENCODING_OFFSET
// (We add offset so that agent 0 has non-zero encoding and is detectable as
// agent-encoded) Note: This limits us to encoding agents 0-62 (63 agents), since 6 bits
// can hold 0-63 and we reserve 0 to detect unencoded IDs
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is inconsistent with the encoding in id_decode.hpp. The 6-bit field can hold values 0-63, and after adding AGENT_ENCODING_OFFSET (1), the encoded value would be 1-64. Since 64 exceeds 6 bits, the check should be >= 63 (which it is), but this creates an off-by-one issue. The actual limit should be logical_node_id < 63 (i.e., 0-62) to ensure encoded values fit in 6 bits (1-63). The check should use >= 63 or the comment should clarify that only 63 agents (0-62) are supported.

Suggested change
// Encode agent into counter ID using agent's logical_node_id + AGENT_ENCODING_OFFSET
// (We add offset so that agent 0 has non-zero encoding and is detectable as
// agent-encoded) Note: This limits us to encoding agents 0-62 (63 agents), since 6 bits
// can hold 0-63 and we reserve 0 to detect unencoded IDs
// Encode agent into counter ID using agent's logical_node_id + AGENT_ENCODING_OFFSET.
// We add offset so that agent 0 has non-zero encoding and is detectable as agent-encoded.
// Only logical_node_id values 0-62 (i.e., 63 agents) are supported, since
// adding AGENT_ENCODING_OFFSET (1) results in encoded values 1-63, which fit in a 6-bit field.

Copilot uses AI. Check for mistakes.

Comment on lines +458 to +467

// Otherwise, find an agent with matching architecture name
for(const auto* agent : rocprofiler::agent::get_agents())
{
if(agent && std::string(agent->name) == _agent)
{
return getBlockDimensions(agent->id, metric);
}
}
// If no agent found, return empty dimensions
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When multiple agents share the same architecture name (the scenario this PR aims to fix), this loop returns dimensions for the first matching agent, which may not be the correct one. The fallback logic should either require agent_id to be provided or document that using architecture name alone is unreliable in multi-GPU scenarios.

Suggested change
// Otherwise, find an agent with matching architecture name
for(const auto* agent : rocprofiler::agent::get_agents())
{
if(agent && std::string(agent->name) == _agent)
{
return getBlockDimensions(agent->id, metric);
}
}
// If no agent found, return empty dimensions
// In multi-GPU scenarios, architecture name alone is unreliable.
// Require agent_id to be provided.
ROC_LOG_ERROR("set_dimensions: agent_id must be provided; architecture name alone is unreliable in multi-GPU scenarios.");

Copilot uses AI. Check for mistakes.

Comment on lines +175 to +180
auto agent_index =
counters::get_agent_from_counter_id(counter_id) - counters::AGENT_ENCODING_OFFSET;
for(const auto* agent : rocprofiler::agent::get_agents())
{
if(!agent) continue;
if(agent->logical_node_id == agent_index)
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern of extracting agent_index and iterating through agents to find matching logical_node_id is duplicated multiple times in this file (lines 175-185, 296-306, 423-432). Consider extracting this into a helper function like get_agent_from_counter_id(counter_id) to improve maintainability and reduce code duplication.

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant