-
Notifications
You must be signed in to change notification settings - Fork 51
Fix dimension mismatch for multi-GPU systems with identical architect… #1440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…ures This change addresses an issue where counter dimensions were incorrectly shared across all GPU agents with the same architecture name, even when those agents had different hardware configurations (e.g., different CU counts). Changes: - Updated getBlockDimensions() to accept agent ID instead of architecture name - Made dimension cache agent-specific instead of architecture-specific - Updated set_dimensions() in AST evaluation to use specific agent ID - Modified all API functions to handle agent-specific dimension lookups - Updated tests to work with agent-specific dimensions This fix ensures that dimensions accurately reflect the actual hardware configuration of each individual GPU agent, preventing dimension mismatches in multi-GPU systems where GPUs share the same architecture but have different physical configurations. Counter ID Representation Changes: - Modified counter_id encoding to include agent information in bits 37-32 - Agent logical_node_id is encoded as (value + 1) to ensure agent 0 is detectable - Counter records internally store only 16-bit base metric IDs (bits 15-0) - Tool reconstructs agent-encoded counter IDs from base metric ID & agent info - Instance record counter_id field uses bitwise AND mask to extract base metric ID (counter_id.handle & 0xFFFF) to fit in 16-bit storage - Output generators (CSV, JSON, Perfetto) use agent-encoded IDs for consistency - Updated counter_config.cpp and metrics.cpp to extract base metric ID when needed - All counter lookups now properly handle agent-encoded vs base metric IDs This ensures counter IDs are consistent between metadata and output records while maintaining compact storage in instance records. Co-Authored-By: Claude <[email protected]>
809b2eb
to
6d8ce5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes dimension mismatches in multi-GPU systems where GPUs share the same architecture name but have different hardware configurations (e.g., different CU counts). The fix makes counter dimensions and IDs agent-specific rather than architecture-specific, ensuring accurate hardware representation per GPU.
Key changes:
- Counter IDs now encode agent information (bits 37-32) to uniquely identify counters per agent
- Dimension caching and lookup changed from architecture-based to agent-specific
- Counter ID encoding/decoding utilities added to handle agent-specific IDs throughout the system
Reviewed Changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.
Show a summary per file
File | Description |
---|---|
metrics_test.cpp | Updated tests to handle agent-specific dimensions through API calls instead of direct cache access |
dimension.cpp | Modified to use agent ID instead of architecture name for dimension lookups |
core.cpp | Updated agent iteration test to extract base metric IDs from agent-encoded counter IDs |
metrics.hpp/cpp | Added agent-specific metric retrieval and counter ID encoding logic |
id_decode.hpp | Introduced counter ID encoding/decoding functions for agent-specific IDs |
evaluate_ast.hpp/cpp | Updated AST dimension evaluation to accept agent ID parameter |
dimensions.hpp/cpp | Changed dimension cache and block dimension functions to be agent-specific |
counters.cpp | Updated all API functions to handle agent-encoded counter IDs |
counter_config.cpp | Modified to extract base metric IDs from agent-encoded counter IDs |
helpers.cpp | Removed obsolete v1 dimension comparison test |
tool.cpp | Added logic to reconstruct agent-encoded counter IDs from instance records |
generatePerfetto.cpp | Added clarifying comments about agent-encoded IDs in output |
generateCSV.cpp | Added clarifying comments about agent-encoded IDs in output |
print_functional_counters_client.cpp | Updated to query instance count for specific agent |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
using DimSync = | ||
common::Synchronized<std::unordered_map<uint64_t, std::shared_ptr<const metric_dims>>>; | ||
static DimSync*& dim_data = common::static_object<DimSync>::construct(); |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The static dimension cache is now keyed by agent_id.handle but is never cleared or bounded. In systems with many agents or dynamic agent addition/removal, this could lead to unbounded memory growth. Consider implementing a cleanup mechanism or documenting the memory implications.
Copilot uses AI. Check for mistakes.
CHECK(agent_logical_node_id < (1 << AGENT_BIT_LENGTH)) | ||
<< "Agent logical_node_id exceeds 6-bit limit (max 63)"; |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check allows values 0-63 (64 values), but the code adds AGENT_ENCODING_OFFSET (1) to the logical_node_id before encoding. This means logical_node_id 63 + 1 = 64, which exceeds the 6-bit limit (max value 63). The check should be agent_logical_node_id < (1 << AGENT_BIT_LENGTH) - AGENT_ENCODING_OFFSET
or equivalent to account for the offset.
CHECK(agent_logical_node_id < (1 << AGENT_BIT_LENGTH)) | |
<< "Agent logical_node_id exceeds 6-bit limit (max 63)"; | |
CHECK(agent_logical_node_id < ((1 << AGENT_BIT_LENGTH) - AGENT_ENCODING_OFFSET)) | |
<< "Agent logical_node_id + AGENT_ENCODING_OFFSET exceeds 6-bit limit (max 63)"; |
Copilot uses AI. Check for mistakes.
// Encode agent into counter ID using agent's logical_node_id + AGENT_ENCODING_OFFSET | ||
// (We add offset so that agent 0 has non-zero encoding and is detectable as | ||
// agent-encoded) Note: This limits us to encoding agents 0-62 (63 agents), since 6 bits | ||
// can hold 0-63 and we reserve 0 to detect unencoded IDs |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is inconsistent with the encoding in id_decode.hpp. The 6-bit field can hold values 0-63, and after adding AGENT_ENCODING_OFFSET (1), the encoded value would be 1-64. Since 64 exceeds 6 bits, the check should be >= 63
(which it is), but this creates an off-by-one issue. The actual limit should be logical_node_id < 63 (i.e., 0-62) to ensure encoded values fit in 6 bits (1-63). The check should use >= 63
or the comment should clarify that only 63 agents (0-62) are supported.
// Encode agent into counter ID using agent's logical_node_id + AGENT_ENCODING_OFFSET | |
// (We add offset so that agent 0 has non-zero encoding and is detectable as | |
// agent-encoded) Note: This limits us to encoding agents 0-62 (63 agents), since 6 bits | |
// can hold 0-63 and we reserve 0 to detect unencoded IDs | |
// Encode agent into counter ID using agent's logical_node_id + AGENT_ENCODING_OFFSET. | |
// We add offset so that agent 0 has non-zero encoding and is detectable as agent-encoded. | |
// Only logical_node_id values 0-62 (i.e., 63 agents) are supported, since | |
// adding AGENT_ENCODING_OFFSET (1) results in encoded values 1-63, which fit in a 6-bit field. |
Copilot uses AI. Check for mistakes.
|
||
// Otherwise, find an agent with matching architecture name | ||
for(const auto* agent : rocprofiler::agent::get_agents()) | ||
{ | ||
if(agent && std::string(agent->name) == _agent) | ||
{ | ||
return getBlockDimensions(agent->id, metric); | ||
} | ||
} | ||
// If no agent found, return empty dimensions |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When multiple agents share the same architecture name (the scenario this PR aims to fix), this loop returns dimensions for the first matching agent, which may not be the correct one. The fallback logic should either require agent_id to be provided or document that using architecture name alone is unreliable in multi-GPU scenarios.
// Otherwise, find an agent with matching architecture name | |
for(const auto* agent : rocprofiler::agent::get_agents()) | |
{ | |
if(agent && std::string(agent->name) == _agent) | |
{ | |
return getBlockDimensions(agent->id, metric); | |
} | |
} | |
// If no agent found, return empty dimensions | |
// In multi-GPU scenarios, architecture name alone is unreliable. | |
// Require agent_id to be provided. | |
ROC_LOG_ERROR("set_dimensions: agent_id must be provided; architecture name alone is unreliable in multi-GPU scenarios."); |
Copilot uses AI. Check for mistakes.
auto agent_index = | ||
counters::get_agent_from_counter_id(counter_id) - counters::AGENT_ENCODING_OFFSET; | ||
for(const auto* agent : rocprofiler::agent::get_agents()) | ||
{ | ||
if(!agent) continue; | ||
if(agent->logical_node_id == agent_index) |
Copilot
AI
Oct 17, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pattern of extracting agent_index and iterating through agents to find matching logical_node_id is duplicated multiple times in this file (lines 175-185, 296-306, 423-432). Consider extracting this into a helper function like get_agent_from_counter_id(counter_id)
to improve maintainability and reduce code duplication.
Copilot uses AI. Check for mistakes.
…ures
This change addresses an issue where counter dimensions were incorrectly shared across all GPU agents with the same architecture name, even when those agents had different hardware configurations (e.g., different CU counts).
Changes:
This fix ensures that dimensions accurately reflect the actual hardware configuration of each individual GPU agent, preventing dimension mismatches in multi-GPU systems where GPUs share the same architecture but have different physical configurations.
Counter ID Representation Changes:
This ensures counter IDs are consistent between metadata and output records while maintaining compact storage in instance records.
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist