-
Notifications
You must be signed in to change notification settings - Fork 42
Remove hardcoded --distribution=arbitrary
#766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughDistribution directive emission was moved to the top when configured (emit Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (2 passed)
✨ Finishing touches
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro 📒 Files selected for processing (2)
🧰 Additional context used🧠 Learnings (3)📓 Common learnings📚 Learning: 2025-12-16T19:47:41.994ZApplied to files:
📚 Learning: 2025-12-05T13:59:40.479ZApplied to files:
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
🔇 Additional comments (4)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Fix all issues with AI Agents 🤖
In @tests/slurm_command_gen_strategy/test_common_slurm_command_gen_strategy.py:
- Line 345: Remove the incorrect assertion that checks for a hardcoded
distribution when nodes are explicitly provided: in the test
test_common_slurm_command_gen_strategy.py (test that contains the line `assert
"#SBATCH --distribution=block" in content`), either delete that assertion or
replace it with `assert "#SBATCH --distribution=block" not in content` so the
test verifies that no distribution directive is added when `#SBATCH
--nodelist=...` is present.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (2)
src/cloudai/systems/slurm/slurm_command_gen_strategy.pytests/slurm_command_gen_strategy/test_common_slurm_command_gen_strategy.py
💤 Files with no reviewable changes (1)
- src/cloudai/systems/slurm/slurm_command_gen_strategy.py
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 760
File: tests/standalone_command_gen_strategy/test_aiconfigurator_standalone_command_gen_strategy.py:33-122
Timestamp: 2025-12-17T22:24:51.805Z
Learning: In the NVIDIA/cloudai repository, avoid suggesting overly nitpick refactor comments such as test parametrization when there are only two test cases with different modes (e.g., agg vs disagg). Such refactoring suggestions are not needed unless explicitly requested.
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 739
File: src/cloudai/workloads/ai_dynamo/report_generation_strategy.py:123-138
Timestamp: 2025-12-05T13:59:40.479Z
Learning: In the AI Dynamo workload for CloudAI, num_nodes fields in WorkerBaseArgs can be typed as `int | list[int]`, but lists are unrolled at the cmd_gen/json_gen level. By the time report generation runs, only scalar integer values are present in num_nodes fields. The Slurm command generation strategy enforces this with explicit assertions.
📚 Learning: 2025-12-05T13:59:40.479Z
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 739
File: src/cloudai/workloads/ai_dynamo/report_generation_strategy.py:123-138
Timestamp: 2025-12-05T13:59:40.479Z
Learning: In the AI Dynamo workload for CloudAI, num_nodes fields in WorkerBaseArgs can be typed as `int | list[int]`, but lists are unrolled at the cmd_gen/json_gen level. By the time report generation runs, only scalar integer values are present in num_nodes fields. The Slurm command generation strategy enforces this with explicit assertions.
Applied to files:
tests/slurm_command_gen_strategy/test_common_slurm_command_gen_strategy.py
📚 Learning: 2025-12-16T19:47:41.994Z
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 754
File: src/cloudai/_core/registry.py:226-234
Timestamp: 2025-12-16T19:47:41.994Z
Learning: In this repository, prefer expressing behavioral documentation through tests rather than docstrings. Tests act as living, verified documentation. Reserve docstrings for interfaces or high-level descriptions, and avoid duplicating behavior that is already covered by tests.
Applied to files:
tests/slurm_command_gen_strategy/test_common_slurm_command_gen_strategy.py
🪛 GitHub Actions: CI
tests/slurm_command_gen_strategy/test_common_slurm_command_gen_strategy.py
[error] 345-345: Test 'test_append_distribution_and_hostfile_with_nodes' failed: expected '#SBATCH --distribution=block' in content but found '#SBATCH --nodelist=node1,node2'.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Greptile Review
tests/slurm_command_gen_strategy/test_common_slurm_command_gen_strategy.py
Show resolved
Hide resolved
Greptile SummaryRemoved hardcoded Key Changes:
Impact:
Confidence Score: 5/5
Important Files Changed
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
tests/slurm_command_gen_strategy/test_common_slurm_command_gen_strategy.py, line 345 (link)logic: this assertion will fail - when
node_listis provided, the code takes theif node_list:branch in_append_nodes_related_directives()(line 406-416) and returns early without adding any distribution directive
2 files reviewed, 1 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cloudai/systems/slurm/slurm_command_gen_strategy.py (1)
1-2: Fix copyright year to resolve CI failure.The CI pipeline is failing because the copyright header expects 2026 but the file contains 2025. Update line 2 to include 2026.
🔎 Proposed fix
-# Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
📒 Files selected for processing (1)
src/cloudai/systems/slurm/slurm_command_gen_strategy.py
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 760
File: tests/standalone_command_gen_strategy/test_aiconfigurator_standalone_command_gen_strategy.py:33-122
Timestamp: 2025-12-17T22:24:51.805Z
Learning: In the NVIDIA/cloudai repository, avoid suggesting overly nitpick refactor comments such as test parametrization when there are only two test cases with different modes (e.g., agg vs disagg). Such refactoring suggestions are not needed unless explicitly requested.
📚 Learning: 2025-12-05T13:59:40.479Z
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 739
File: src/cloudai/workloads/ai_dynamo/report_generation_strategy.py:123-138
Timestamp: 2025-12-05T13:59:40.479Z
Learning: In the AI Dynamo workload for CloudAI, num_nodes fields in WorkerBaseArgs can be typed as `int | list[int]`, but lists are unrolled at the cmd_gen/json_gen level. By the time report generation runs, only scalar integer values are present in num_nodes fields. The Slurm command generation strategy enforces this with explicit assertions.
Applied to files:
src/cloudai/systems/slurm/slurm_command_gen_strategy.py
📚 Learning: 2025-12-16T19:47:41.994Z
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 754
File: src/cloudai/_core/registry.py:226-234
Timestamp: 2025-12-16T19:47:41.994Z
Learning: In this repository, prefer expressing behavioral documentation through tests rather than docstrings. Tests act as living, verified documentation. Reserve docstrings for interfaces or high-level descriptions, and avoid duplicating behavior that is already covered by tests.
Applied to files:
src/cloudai/systems/slurm/slurm_command_gen_strategy.py
🧬 Code graph analysis (1)
src/cloudai/systems/slurm/slurm_command_gen_strategy.py (1)
src/cloudai/parser.py (1)
system(57-65)
🪛 GitHub Actions: CI
src/cloudai/systems/slurm/slurm_command_gen_strategy.py
[error] 1-1: Copyright year is not valid. The header expects the year 2026, but the file has 2025 (SPDX header check failing in test_src_copyright_header).
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Greptile Review
🔇 Additional comments (1)
src/cloudai/systems/slurm/slurm_command_gen_strategy.py (1)
406-408: Conditional distribution directive correctly implements system-level configuration.The code at lines 406-408 properly checks for an explicitly configured distribution before emitting the
#SBATCHdirective. This allows systems to specify their preferred distribution strategy or omit it to use Slurm's default behavior.However, system configuration files in the repository (e.g.,
conf/common/system/example_slurm_cluster.toml) do not currently specify adistributionsetting, which means they will rely on Slurm's default behavior. Verify that this aligns with expected performance characteristics for all target systems.
Greptile found no issues!From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section. This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR". |
amaslenn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution! Please address CI issues and we will proceed.
|
Thanks! The CI issue seems to be the copyright year, should I fix it in this PR or create a new one? |
In this PR, please. We can't merge changes that do not pass CI. |
Summary
Remove hardcoded
--distribution=arbitrary, this mode can lead to weaker affinity/pinning behavior unless explicit binding is added, causing a noticeable performance drop.Also the distribution mode could be specified in the system configuration file, could safely remove it here.
Testing Notes
Compared walltime/throughput before vs after, regression disappears when removing
--distribution=arbitrary