Fix dimension mismatch for multi-GPU systems with identical architect… #1440

bwelton · 2025-10-17T04:12:57Z

…ures

This change addresses an issue where counter dimensions were incorrectly shared across all GPU agents with the same architecture name, even when those agents had different hardware configurations (e.g., different CU counts).

Changes:

Updated getBlockDimensions() to accept agent ID instead of architecture name
Made dimension cache agent-specific instead of architecture-specific
Updated set_dimensions() in AST evaluation to use specific agent ID
Modified all API functions to handle agent-specific dimension lookups
Updated tests to work with agent-specific dimensions

This fix ensures that dimensions accurately reflect the actual hardware configuration of each individual GPU agent, preventing dimension mismatches in multi-GPU systems where GPUs share the same architecture but have different physical configurations.

Counter ID Representation Changes:

Modified counter_id encoding to include agent information in bits 37-32
Agent logical_node_id is encoded as (value + 1) to ensure agent 0 is detectable
Counter records internally store only 16-bit base metric IDs (bits 15-0)
Tool reconstructs agent-encoded counter IDs from base metric ID & agent info
Instance record counter_id field uses bitwise AND mask to extract base metric ID (counter_id.handle & 0xFFFF) to fit in 16-bit storage
Output generators (CSV, JSON, Perfetto) use agent-encoded IDs for consistency
Updated counter_config.cpp and metrics.cpp to extract base metric ID when needed
All counter lookups now properly handle agent-encoded vs base metric IDs

This ensures counter IDs are consistent between metadata and output records while maintaining compact storage in instance records.

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…ures This change addresses an issue where counter dimensions were incorrectly shared across all GPU agents with the same architecture name, even when those agents had different hardware configurations (e.g., different CU counts). Changes: - Updated getBlockDimensions() to accept agent ID instead of architecture name - Made dimension cache agent-specific instead of architecture-specific - Updated set_dimensions() in AST evaluation to use specific agent ID - Modified all API functions to handle agent-specific dimension lookups - Updated tests to work with agent-specific dimensions This fix ensures that dimensions accurately reflect the actual hardware configuration of each individual GPU agent, preventing dimension mismatches in multi-GPU systems where GPUs share the same architecture but have different physical configurations. Counter ID Representation Changes: - Modified counter_id encoding to include agent information in bits 37-32 - Agent logical_node_id is encoded as (value + 1) to ensure agent 0 is detectable - Counter records internally store only 16-bit base metric IDs (bits 15-0) - Tool reconstructs agent-encoded counter IDs from base metric ID & agent info - Instance record counter_id field uses bitwise AND mask to extract base metric ID (counter_id.handle & 0xFFFF) to fit in 16-bit storage - Output generators (CSV, JSON, Perfetto) use agent-encoded IDs for consistency - Updated counter_config.cpp and metrics.cpp to extract base metric ID when needed - All counter lookups now properly handle agent-encoded vs base metric IDs This ensures counter IDs are consistent between metadata and output records while maintaining compact storage in instance records. Co-Authored-By: Claude <[email protected]>

Copilot

Pull Request Overview

This PR fixes dimension mismatches in multi-GPU systems where GPUs share the same architecture name but have different hardware configurations (e.g., different CU counts). The fix makes counter dimensions and IDs agent-specific rather than architecture-specific, ensuring accurate hardware representation per GPU.

Key changes:

Counter IDs now encode agent information (bits 37-32) to uniquely identify counters per agent
Dimension caching and lookup changed from architecture-based to agent-specific
Counter ID encoding/decoding utilities added to handle agent-specific IDs throughout the system

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
metrics_test.cpp	Updated tests to handle agent-specific dimensions through API calls instead of direct cache access
dimension.cpp	Modified to use agent ID instead of architecture name for dimension lookups
core.cpp	Updated agent iteration test to extract base metric IDs from agent-encoded counter IDs
metrics.hpp/cpp	Added agent-specific metric retrieval and counter ID encoding logic
id_decode.hpp	Introduced counter ID encoding/decoding functions for agent-specific IDs
evaluate_ast.hpp/cpp	Updated AST dimension evaluation to accept agent ID parameter
dimensions.hpp/cpp	Changed dimension cache and block dimension functions to be agent-specific
counters.cpp	Updated all API functions to handle agent-encoded counter IDs
counter_config.cpp	Modified to extract base metric IDs from agent-encoded counter IDs
helpers.cpp	Removed obsolete v1 dimension comparison test
tool.cpp	Added logic to reconstruct agent-encoded counter IDs from instance records
generatePerfetto.cpp	Added clarifying comments about agent-encoded IDs in output
generateCSV.cpp	Added clarifying comments about agent-encoded IDs in output
print_functional_counters_client.cpp	Updated to query instance count for specific agent

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-17T04:14:40Z

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/dimensions.cpp

+    using DimSync =
+        common::Synchronized<std::unordered_map<uint64_t, std::shared_ptr<const metric_dims>>>;
+    static DimSync*& dim_data = common::static_object<DimSync>::construct();


The static dimension cache is now keyed by agent_id.handle but is never cleared or bounded. In systems with many agents or dynamic agent addition/removal, this could lead to unbounded memory growth. Consider implementing a cleanup mechanism or documenting the memory implications.

Copilot · 2025-10-17T04:14:40Z

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/id_decode.hpp

+    CHECK(agent_logical_node_id < (1 << AGENT_BIT_LENGTH))
+        << "Agent logical_node_id exceeds 6-bit limit (max 63)";


The check allows values 0-63 (64 values), but the code adds AGENT_ENCODING_OFFSET (1) to the logical_node_id before encoding. This means logical_node_id 63 + 1 = 64, which exceeds the 6-bit limit (max value 63). The check should be agent_logical_node_id < (1 << AGENT_BIT_LENGTH) - AGENT_ENCODING_OFFSET or equivalent to account for the offset.

Suggested change

CHECK(agent_logical_node_id < (1 << AGENT_BIT_LENGTH))

<< "Agent logical_node_id exceeds 6-bit limit (max 63)";

CHECK(agent_logical_node_id < ((1 << AGENT_BIT_LENGTH) - AGENT_ENCODING_OFFSET))

<< "Agent logical_node_id + AGENT_ENCODING_OFFSET exceeds 6-bit limit (max 63)";

Copilot · 2025-10-17T04:14:41Z

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/metrics.cpp

+            // Encode agent into counter ID using agent's logical_node_id + AGENT_ENCODING_OFFSET
+            // (We add offset so that agent 0 has non-zero encoding and is detectable as
+            // agent-encoded) Note: This limits us to encoding agents 0-62 (63 agents), since 6 bits
+            // can hold 0-63 and we reserve 0 to detect unencoded IDs


This check is inconsistent with the encoding in id_decode.hpp. The 6-bit field can hold values 0-63, and after adding AGENT_ENCODING_OFFSET (1), the encoded value would be 1-64. Since 64 exceeds 6 bits, the check should be >= 63 (which it is), but this creates an off-by-one issue. The actual limit should be logical_node_id < 63 (i.e., 0-62) to ensure encoded values fit in 6 bits (1-63). The check should use >= 63 or the comment should clarify that only 63 agents (0-62) are supported.

Suggested change

// Encode agent into counter ID using agent's logical_node_id + AGENT_ENCODING_OFFSET

// (We add offset so that agent 0 has non-zero encoding and is detectable as

// agent-encoded) Note: This limits us to encoding agents 0-62 (63 agents), since 6 bits

// can hold 0-63 and we reserve 0 to detect unencoded IDs

// Encode agent into counter ID using agent's logical_node_id + AGENT_ENCODING_OFFSET.

// We add offset so that agent 0 has non-zero encoding and is detectable as agent-encoded.

// Only logical_node_id values 0-62 (i.e., 63 agents) are supported, since

// adding AGENT_ENCODING_OFFSET (1) results in encoded values 1-63, which fit in a 6-bit field.

Copilot · 2025-10-17T04:14:41Z

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters/evaluate_ast.cpp

+
+        // Otherwise, find an agent with matching architecture name
+        for(const auto* agent : rocprofiler::agent::get_agents())
+        {
+            if(agent && std::string(agent->name) == _agent)
+            {
+                return getBlockDimensions(agent->id, metric);
+            }
+        }
+        // If no agent found, return empty dimensions


When multiple agents share the same architecture name (the scenario this PR aims to fix), this loop returns dimensions for the first matching agent, which may not be the correct one. The fallback logic should either require agent_id to be provided or document that using architecture name alone is unreliable in multi-GPU scenarios.

Suggested change

// Otherwise, find an agent with matching architecture name

for(const auto* agent : rocprofiler::agent::get_agents())

{

if(agent && std::string(agent->name) == _agent)

{

return getBlockDimensions(agent->id, metric);

}

}

// If no agent found, return empty dimensions

// In multi-GPU scenarios, architecture name alone is unreliable.

// Require agent_id to be provided.

ROC_LOG_ERROR("set_dimensions: agent_id must be provided; architecture name alone is unreliable in multi-GPU scenarios.");

Copilot · 2025-10-17T04:14:41Z

projects/rocprofiler-sdk/source/lib/rocprofiler-sdk/counters.cpp

+        auto agent_index =
+            counters::get_agent_from_counter_id(counter_id) - counters::AGENT_ENCODING_OFFSET;
+        for(const auto* agent : rocprofiler::agent::get_agents())
+        {
+            if(!agent) continue;
+            if(agent->logical_node_id == agent_index)


This pattern of extracting agent_index and iterating through agents to find matching logical_node_id is duplicated multiple times in this file (lines 175-185, 296-306, 423-432). Consider extracting this into a helper function like get_agent_from_counter_id(counter_id) to improve maintainability and reduce code duplication.

Copilot AI review requested due to automatic review settings October 17, 2025 04:12

bwelton force-pushed the bewelton/dim-fx branch from 809b2eb to 6d8ce5e Compare October 17, 2025 04:13

github-actions bot added the project: rocprofiler-sdk label Oct 17, 2025

Copilot AI reviewed Oct 17, 2025

View reviewed changes

systems-assistant bot added the organization: ROCm label Oct 17, 2025

Debug prints

2d69242

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix dimension mismatch for multi-GPU systems with identical architect… #1440

Fix dimension mismatch for multi-GPU systems with identical architect… #1440

Uh oh!

bwelton commented Oct 17, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 17, 2025

Uh oh!

Copilot AI Oct 17, 2025

Uh oh!

Copilot AI Oct 17, 2025

Uh oh!

Copilot AI Oct 17, 2025

Uh oh!

Copilot AI Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		CHECK(agent_logical_node_id < (1 << AGENT_BIT_LENGTH))
		<< "Agent logical_node_id exceeds 6-bit limit (max 63)";

Fix dimension mismatch for multi-GPU systems with identical architect… #1440

Are you sure you want to change the base?

Fix dimension mismatch for multi-GPU systems with identical architect… #1440

Uh oh!

Conversation

bwelton commented Oct 17, 2025

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant