Skip to content

Conversation

@juntaowww
Copy link
Contributor

Summary

Remove hardcoded --distribution=arbitrary, this mode can lead to weaker affinity/pinning behavior unless explicit binding is added, causing a noticeable performance drop.

Also the distribution mode could be specified in the system configuration file, could safely remove it here.

Testing Notes

Compared walltime/throughput before vs after, regression disappears when removing --distribution=arbitrary

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 5, 2026

📝 Walkthrough

Walkthrough

Distribution directive emission was moved to the top when configured (emit --distribution before other SBATCH directives). Distribution directives were removed from the node_list and -N paths. Hostfile generation and other node-list/-N behaviors are unchanged; one test expectation updated from arbitrary to block.

Changes

Cohort / File(s) Summary
SLURM distribution placement
src/cloudai/systems/slurm/slurm_command_gen_strategy.py
Emit #SBATCH --distribution=... at the top if self.system.distribution is set; remove insertion of --distribution=arbitrary from node_list and -N branches. Hostfile and nodelist/-N logic otherwise unchanged.
Test update
tests/slurm_command_gen_strategy/test_common_slurm_command_gen_strategy.py
Update test_append_distribution_and_hostfile_with_nodes to expect #SBATCH --distribution=block (and adjust file header year).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 I hopped through lines of SBATCH light,
Moved distribution to the topmost right,
Nodes keep their files, the list stays true,
Tests changed a word — now "block" hops through. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main change: removing hardcoded --distribution=arbitrary flag from the Slurm command generation.
Description check ✅ Passed The description is directly related to the changeset, explaining why the hardcoded flag is being removed and providing testing evidence.
✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8fcc20a and 812cf11.

📒 Files selected for processing (2)
  • src/cloudai/systems/slurm/slurm_command_gen_strategy.py
  • tests/slurm_command_gen_strategy/test_common_slurm_command_gen_strategy.py
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 760
File: tests/standalone_command_gen_strategy/test_aiconfigurator_standalone_command_gen_strategy.py:33-122
Timestamp: 2025-12-17T22:24:51.805Z
Learning: In the NVIDIA/cloudai repository, avoid suggesting overly nitpick refactor comments such as test parametrization when there are only two test cases with different modes (e.g., agg vs disagg). Such refactoring suggestions are not needed unless explicitly requested.
📚 Learning: 2025-12-16T19:47:41.994Z
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 754
File: src/cloudai/_core/registry.py:226-234
Timestamp: 2025-12-16T19:47:41.994Z
Learning: In this repository, prefer expressing behavioral documentation through tests rather than docstrings. Tests act as living, verified documentation. Reserve docstrings for interfaces or high-level descriptions, and avoid duplicating behavior that is already covered by tests.

Applied to files:

  • tests/slurm_command_gen_strategy/test_common_slurm_command_gen_strategy.py
  • src/cloudai/systems/slurm/slurm_command_gen_strategy.py
📚 Learning: 2025-12-05T13:59:40.479Z
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 739
File: src/cloudai/workloads/ai_dynamo/report_generation_strategy.py:123-138
Timestamp: 2025-12-05T13:59:40.479Z
Learning: In the AI Dynamo workload for CloudAI, num_nodes fields in WorkerBaseArgs can be typed as `int | list[int]`, but lists are unrolled at the cmd_gen/json_gen level. By the time report generation runs, only scalar integer values are present in num_nodes fields. The Slurm command generation strategy enforces this with explicit assertions.

Applied to files:

  • src/cloudai/systems/slurm/slurm_command_gen_strategy.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Greptile Review
  • GitHub Check: Run pytest (3.10)
  • GitHub Check: Run pytest (3.12)
🔇 Additional comments (4)
src/cloudai/systems/slurm/slurm_command_gen_strategy.py (2)

2-2: LGTM!

Copyright year updated correctly to include 2026, as noted in the PR comments addressing the CI failure.


406-408: Correct implementation of configurable distribution directive.

The distribution directive is now emitted only when explicitly configured via self.system.distribution, using the configured value instead of a hardcoded "arbitrary". This placement at the top of _append_nodes_related_directives ensures the directive is emitted regardless of whether nodes are specified via --nodelist or -N, which aligns with Slurm semantics where --distribution and node specification are independent, complementary options.

This implementation correctly addresses the PR objective of removing the hardcoded --distribution=arbitrary that was causing performance regression.

tests/slurm_command_gen_strategy/test_common_slurm_command_gen_strategy.py (2)

2-2: LGTM!

Copyright year updated correctly to include 2026.


345-345: Test assertion is correct; past review concern has been resolved.

The assertion correctly validates that the distribution directive is emitted when slurm_system.distribution is configured, even with an explicit node list. The implementation (lines 406-408 in slurm_command_gen_strategy.py) was updated to emit the distribution directive at the top of SBATCH directives when configured, making it independent of the node specification method. This is the proper Slurm behavior, as --distribution and --nodelist are complementary options.

The past review comment suggesting removal of this assertion is now outdated—the implementation was correctly fixed to match the test expectation rather than the other way around.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Fix all issues with AI Agents 🤖
In @tests/slurm_command_gen_strategy/test_common_slurm_command_gen_strategy.py:
- Line 345: Remove the incorrect assertion that checks for a hardcoded
distribution when nodes are explicitly provided: in the test
test_common_slurm_command_gen_strategy.py (test that contains the line `assert
"#SBATCH --distribution=block" in content`), either delete that assertion or
replace it with `assert "#SBATCH --distribution=block" not in content` so the
test verifies that no distribution directive is added when `#SBATCH
--nodelist=...` is present.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 33ab41e and c8d33c4.

📒 Files selected for processing (2)
  • src/cloudai/systems/slurm/slurm_command_gen_strategy.py
  • tests/slurm_command_gen_strategy/test_common_slurm_command_gen_strategy.py
💤 Files with no reviewable changes (1)
  • src/cloudai/systems/slurm/slurm_command_gen_strategy.py
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 760
File: tests/standalone_command_gen_strategy/test_aiconfigurator_standalone_command_gen_strategy.py:33-122
Timestamp: 2025-12-17T22:24:51.805Z
Learning: In the NVIDIA/cloudai repository, avoid suggesting overly nitpick refactor comments such as test parametrization when there are only two test cases with different modes (e.g., agg vs disagg). Such refactoring suggestions are not needed unless explicitly requested.
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 739
File: src/cloudai/workloads/ai_dynamo/report_generation_strategy.py:123-138
Timestamp: 2025-12-05T13:59:40.479Z
Learning: In the AI Dynamo workload for CloudAI, num_nodes fields in WorkerBaseArgs can be typed as `int | list[int]`, but lists are unrolled at the cmd_gen/json_gen level. By the time report generation runs, only scalar integer values are present in num_nodes fields. The Slurm command generation strategy enforces this with explicit assertions.
📚 Learning: 2025-12-05T13:59:40.479Z
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 739
File: src/cloudai/workloads/ai_dynamo/report_generation_strategy.py:123-138
Timestamp: 2025-12-05T13:59:40.479Z
Learning: In the AI Dynamo workload for CloudAI, num_nodes fields in WorkerBaseArgs can be typed as `int | list[int]`, but lists are unrolled at the cmd_gen/json_gen level. By the time report generation runs, only scalar integer values are present in num_nodes fields. The Slurm command generation strategy enforces this with explicit assertions.

Applied to files:

  • tests/slurm_command_gen_strategy/test_common_slurm_command_gen_strategy.py
📚 Learning: 2025-12-16T19:47:41.994Z
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 754
File: src/cloudai/_core/registry.py:226-234
Timestamp: 2025-12-16T19:47:41.994Z
Learning: In this repository, prefer expressing behavioral documentation through tests rather than docstrings. Tests act as living, verified documentation. Reserve docstrings for interfaces or high-level descriptions, and avoid duplicating behavior that is already covered by tests.

Applied to files:

  • tests/slurm_command_gen_strategy/test_common_slurm_command_gen_strategy.py
🪛 GitHub Actions: CI
tests/slurm_command_gen_strategy/test_common_slurm_command_gen_strategy.py

[error] 345-345: Test 'test_append_distribution_and_hostfile_with_nodes' failed: expected '#SBATCH --distribution=block' in content but found '#SBATCH --nodelist=node1,node2'.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Greptile Review

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 5, 2026

Greptile Summary

Removed hardcoded --distribution=arbitrary from Slurm command generation, which was causing performance degradation due to weak affinity/pinning behavior. The distribution mode is now consistently applied from system configuration for both explicit node lists and node count scenarios.

Key Changes:

  • Moved system.distribution check to apply before both node list and node count paths
  • Removed hardcoded --distribution=arbitrary that was only applied when using explicit node lists
  • Updated test to validate system-configured distribution instead of hardcoded value
  • Copyright year updated to 2026

Impact:

  • Fixes performance regression caused by arbitrary distribution mode
  • Ensures consistent distribution behavior across all Slurm job configurations
  • Allows system administrators to control distribution policy through system configuration files

Confidence Score: 5/5

  • This PR is safe to merge - it fixes a performance regression by removing hardcoded behavior that conflicted with system configuration
  • The change is well-tested, logically sound, and addresses a real performance issue. The logic consolidation improves code quality by applying distribution settings consistently. Tests were updated appropriately to validate the new behavior, and the change aligns with the design principle of allowing system-level configuration control
  • No files require special attention

Important Files Changed

Filename Overview
src/cloudai/systems/slurm/slurm_command_gen_strategy.py Removed hardcoded --distribution=arbitrary, now applies system distribution setting consistently for both node list and node count scenarios
tests/slurm_command_gen_strategy/test_common_slurm_command_gen_strategy.py Updated test to expect system-configured distribution (block) instead of hardcoded arbitrary, properly validates the fix

Sequence Diagram

sequenceDiagram
    participant User
    participant SlurmCommandGenStrategy
    participant SlurmSystem
    participant BatchScript

    User->>SlurmCommandGenStrategy: _append_nodes_related_directives()
    SlurmCommandGenStrategy->>SlurmCommandGenStrategy: get_cached_nodes_spec()
    SlurmCommandGenStrategy->>SlurmSystem: Check system.distribution
    
    alt system.distribution is set
        SlurmSystem-->>SlurmCommandGenStrategy: distribution value
        SlurmCommandGenStrategy->>BatchScript: Add --distribution={value}
    end
    
    alt node_list exists
        SlurmCommandGenStrategy->>BatchScript: Add --nodelist={nodes}
        SlurmCommandGenStrategy->>SlurmCommandGenStrategy: Create hostfile.txt
    else num_nodes only
        SlurmCommandGenStrategy->>BatchScript: Add -N {num_nodes}
    end
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. tests/slurm_command_gen_strategy/test_common_slurm_command_gen_strategy.py, line 345 (link)

    logic: this assertion will fail - when node_list is provided, the code takes the if node_list: branch in _append_nodes_related_directives() (line 406-416) and returns early without adding any distribution directive

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cloudai/systems/slurm/slurm_command_gen_strategy.py (1)

1-2: Fix copyright year to resolve CI failure.

The CI pipeline is failing because the copyright header expects 2026 but the file contains 2025. Update line 2 to include 2026.

🔎 Proposed fix
-# Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c8d33c4 and 8fcc20a.

📒 Files selected for processing (1)
  • src/cloudai/systems/slurm/slurm_command_gen_strategy.py
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: srivatsankrishnan
Repo: NVIDIA/cloudai PR: 760
File: tests/standalone_command_gen_strategy/test_aiconfigurator_standalone_command_gen_strategy.py:33-122
Timestamp: 2025-12-17T22:24:51.805Z
Learning: In the NVIDIA/cloudai repository, avoid suggesting overly nitpick refactor comments such as test parametrization when there are only two test cases with different modes (e.g., agg vs disagg). Such refactoring suggestions are not needed unless explicitly requested.
📚 Learning: 2025-12-05T13:59:40.479Z
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 739
File: src/cloudai/workloads/ai_dynamo/report_generation_strategy.py:123-138
Timestamp: 2025-12-05T13:59:40.479Z
Learning: In the AI Dynamo workload for CloudAI, num_nodes fields in WorkerBaseArgs can be typed as `int | list[int]`, but lists are unrolled at the cmd_gen/json_gen level. By the time report generation runs, only scalar integer values are present in num_nodes fields. The Slurm command generation strategy enforces this with explicit assertions.

Applied to files:

  • src/cloudai/systems/slurm/slurm_command_gen_strategy.py
📚 Learning: 2025-12-16T19:47:41.994Z
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 754
File: src/cloudai/_core/registry.py:226-234
Timestamp: 2025-12-16T19:47:41.994Z
Learning: In this repository, prefer expressing behavioral documentation through tests rather than docstrings. Tests act as living, verified documentation. Reserve docstrings for interfaces or high-level descriptions, and avoid duplicating behavior that is already covered by tests.

Applied to files:

  • src/cloudai/systems/slurm/slurm_command_gen_strategy.py
🧬 Code graph analysis (1)
src/cloudai/systems/slurm/slurm_command_gen_strategy.py (1)
src/cloudai/parser.py (1)
  • system (57-65)
🪛 GitHub Actions: CI
src/cloudai/systems/slurm/slurm_command_gen_strategy.py

[error] 1-1: Copyright year is not valid. The header expects the year 2026, but the file has 2025 (SPDX header check failing in test_src_copyright_header).

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Greptile Review
🔇 Additional comments (1)
src/cloudai/systems/slurm/slurm_command_gen_strategy.py (1)

406-408: Conditional distribution directive correctly implements system-level configuration.

The code at lines 406-408 properly checks for an explicitly configured distribution before emitting the #SBATCH directive. This allows systems to specify their preferred distribution strategy or omit it to use Slurm's default behavior.

However, system configuration files in the repository (e.g., conf/common/system/example_slurm_cluster.toml) do not currently specify a distribution setting, which means they will rely on Slurm's default behavior. Verify that this aligns with expected performance characteristics for all target systems.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 5, 2026

Greptile found no issues!

From now on, if a review finishes and we haven't found any issues, we will not post anything, but you can confirm that we reviewed your changes in the status check section.

This feature can be toggled off in your Code Review Settings by deselecting "Create a status check for each PR".

Copy link
Contributor

@amaslenn amaslenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! Please address CI issues and we will proceed.

@juntaowww
Copy link
Contributor Author

Thanks! The CI issue seems to be the copyright year, should I fix it in this PR or create a new one?

@amaslenn
Copy link
Contributor

amaslenn commented Jan 5, 2026

Thanks! The CI issue seems to be the copyright year, should I fix it in this PR or create a new one?

In this PR, please. We can't merge changes that do not pass CI.

@amaslenn amaslenn merged commit 6da438c into NVIDIA:main Jan 5, 2026
5 checks passed
@juntaowww juntaowww deleted the sbatch_distribution branch January 22, 2026 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants