-
Notifications
You must be signed in to change notification settings - Fork 140
Improve Slurm Testing Reproducibility #1050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: George Armstrong <[email protected]>
Signed-off-by: George Armstrong <[email protected]>
Signed-off-by: George Armstrong <[email protected]>
Signed-off-by: George Armstrong <[email protected]>
Signed-off-by: George Armstrong <[email protected]>
📝 WalkthroughWalkthroughThe PR refactors Slurm test infrastructure to centralize cluster configuration handling. New utility functions ( Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
tests/slurm-tests/run_all.sh (1)
3-4: Remove redundant initial assignments.Lines 3-4 assign
CLUSTERandRUN_NAMEfrom positional parameters, but these are immediately reassigned at lines 25-26 after the argument parsing loop. The initial assignments are never used.-CLUSTER=$1 -RUN_NAME=${2:-$(date +%Y-%m-%d)} # Parse --cluster_config_mode flag with default 'assert' CLUSTER_CONFIG_MODE="assert" POSITIONAL_ARGS=()tests/slurm-tests/utils.py (3)
184-192: Clarify redundant job_dir assignment.Line 191 appears redundant. In the
ifbranch (lines 184-186),ssh_tunnel.job_diris set butcluster_config["job_dir"]might not exist. In theelsebranch (lines 187-189),cluster_config["job_dir"]is explicitly set. Then line 191 usesget(..., test_job_dir)which would only do something if neither branch was taken—but one always is.Consider simplifying:
if "ssh_tunnel" in cluster_config: cluster_config["ssh_tunnel"]["job_dir"] = test_job_dir job_dir = cluster_config["ssh_tunnel"]["job_dir"] else: cluster_config["job_dir"] = test_job_dir job_dir = cluster_config["job_dir"] - cluster_config["job_dir"] = cluster_config.get("job_dir", test_job_dir) + # Ensure top-level job_dir is always set for consistency + if "job_dir" not in cluster_config: + cluster_config["job_dir"] = test_job_dir _resolve_container_image_paths(cluster_config)
228-238: Consider narrowing the exception type.The bare
Exceptioncatch (line 234) is intentional for graceful fallback, but could mask unexpected errors. Consider catching more specific exceptions or at minimum logging the exception for debugging.try: tunnel = get_tunnel(cluster_config) result = tunnel.run(f"readlink -f {shlex.quote(path)}", hide=True, warn=True) resolved_remote = result.stdout.strip() if result.exited == 0 else "" return resolved_remote or local_resolved - except Exception: + except (OSError, IOError, RuntimeError) as exc: + # Log for debugging but continue with local fallback return local_resolved finally: if tunnel is not None: tunnel.cleanup()
365-390: Consider using non-deprecated datetime API.
datetime.utcnow()is deprecated since Python 3.12 in favor of timezone-aware alternatives.+from datetime import datetime, timezone + def _collect_repo_metadata() -> dict: """Gather information about the current NeMo-Skills checkout.""" repo_root = _get_repo_root() metadata = { "repo_root": str(repo_root), - "timestamp_utc": datetime.utcnow().isoformat(timespec="seconds") + "Z", + "timestamp_utc": datetime.now(timezone.utc).isoformat(timespec="seconds").replace("+00:00", "Z"), }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
tests/slurm-tests/gpt_oss_python_aime25/run_test.py(3 hunks)tests/slurm-tests/omr_simple_recipe/run_test.py(3 hunks)tests/slurm-tests/qwen3_4b_evals/run_test.py(3 hunks)tests/slurm-tests/qwen3coder_30b_swebench/run_test.py(4 hunks)tests/slurm-tests/run_all.sh(1 hunks)tests/slurm-tests/super_49b_evals/run_test.py(3 hunks)tests/slurm-tests/utils.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
tests/slurm-tests/gpt_oss_python_aime25/run_test.py (2)
tests/slurm-tests/utils.py (2)
add_common_args(287-319)prepare_cluster_config_for_test(118-201)nemo_skills/pipeline/cli.py (1)
wrap_arguments(43-52)
tests/slurm-tests/qwen3_4b_evals/run_test.py (2)
tests/slurm-tests/utils.py (2)
add_common_args(287-319)prepare_cluster_config_for_test(118-201)nemo_skills/pipeline/cli.py (1)
wrap_arguments(43-52)
tests/slurm-tests/qwen3coder_30b_swebench/run_test.py (1)
tests/slurm-tests/utils.py (2)
add_common_args(287-319)prepare_cluster_config_for_test(118-201)
tests/slurm-tests/omr_simple_recipe/run_test.py (2)
tests/slurm-tests/utils.py (2)
add_common_args(287-319)prepare_cluster_config_for_test(118-201)nemo_skills/pipeline/cli.py (1)
wrap_arguments(43-52)
tests/slurm-tests/super_49b_evals/run_test.py (2)
tests/slurm-tests/utils.py (2)
add_common_args(287-319)prepare_cluster_config_for_test(118-201)nemo_skills/pipeline/cli.py (1)
wrap_arguments(43-52)
tests/slurm-tests/utils.py (3)
nemo_skills/pipeline/utils/cluster.py (5)
cluster_download_file(480-482)cluster_path_exists(485-488)cluster_upload(565-579)get_cluster_config(314-368)get_tunnel(451-456)nemo_skills/pipeline/utils/mounts.py (1)
get_mounts_from_config(399-455)nemo_skills/pipeline/utils/exp.py (1)
stdout(124-125)
🪛 Ruff (0.14.7)
tests/slurm-tests/utils.py
61-61: Starting a process with a partial executable path
(S607)
153-156: Avoid specifying long messages outside the exception class
(TRY003)
233-233: Consider moving this statement to an else block
(TRY300)
234-234: Do not catch blind exception: Exception
(BLE001)
258-260: Avoid specifying long messages outside the exception class
(TRY003)
263-263: Avoid specifying long messages outside the exception class
(TRY003)
275-278: Avoid specifying long messages outside the exception class
(TRY003)
348-348: Avoid specifying long messages outside the exception class
(TRY003)
356-359: Avoid specifying long messages outside the exception class
(TRY003)
374-374: subprocess call: check for execution of untrusted input
(S603)
375-375: Starting a process with a partial executable path
(S607)
🔇 Additional comments (13)
tests/slurm-tests/run_all.sh (1)
9-39: LGTM!The argument parsing loop correctly handles the optional
--cluster_config_modeflag while preserving positional parameters, and the flag is consistently propagated to all test invocations.tests/slurm-tests/omr_simple_recipe/run_test.py (2)
16-21: LGTM on the import structure.The
sys.pathmanipulation to import from the parent directory is consistent with other test scripts in this PR.
39-59: Cluster config inconsistency: subprocess bypasses job_dir modificationsThe
clusterobject prepared viaprepare_cluster_config_for_test(which setsjob_dirto{workspace}/nemo-run-experiments) is only applied to therun_cmdcall on line 36-43, but thesimplified_recipesubprocess invocation on line 30 passesargs.cluster(raw cluster name/path). This means all experiments launched bysimplified_recipewill use the default cluster config with the originaljob_dir, causing experiment artifacts to be stored outside the test workspace—only the check_results job will use the modified config.Either pass the prepared cluster config to
simplified_recipe(e.g., via a saved config file or new command-line argument) or clarify if this inconsistency is intentional.tests/slurm-tests/super_49b_evals/run_test.py (2)
16-21: LGTM on centralized imports.The import pattern is consistent with other test files in this PR.
319-361: LGTM on cluster object propagation.The prepared cluster object is correctly passed through to
setup(),eval_reasoning_on(),eval_reasoning_off(), and the finalrun_cmd()call, ensuring consistent use of the modified cluster configuration throughout the test.tests/slurm-tests/qwen3_4b_evals/run_test.py (1)
142-188: LGTM!The cluster configuration is correctly prepared and consistently passed to all evaluation functions (
eval_qwen3_bfcl,eval_qwen3_online_genselect,eval_qwen3_offline_genselect) and the finalrun_cmdcall.tests/slurm-tests/qwen3coder_30b_swebench/run_test.py (1)
51-96: LGTM!The cluster object is correctly prepared and passed to both
eval_qwen3coderandrun_cmdwithin the loop. The localworkspacevariable shadowingargs.workspaceat line 72 is intentional to create agent-specific workspace paths.tests/slurm-tests/gpt_oss_python_aime25/run_test.py (1)
57-88: LGTM!The refactoring correctly uses
add_common_args, prepares the cluster configuration, and consistently passes the cluster object toeval_gpt_oss_pythonandrun_cmd.tests/slurm-tests/utils.py (5)
15-49: LGTM on imports and constants.The imports are appropriate for the functionality, and the constants clearly define the supported modes and provide helpful error messaging for uncommitted changes.
56-72: LGTM on repository root detection.The
lru_cacheis appropriate since the repo root won't change during execution. Usinggitwithout a full path is acceptable as it's expected to be inPATHon systems running these tests.
241-284: LGTM on snapshot synchronization logic.The three-mode handling (
reuse,assert,overwrite) is implemented correctly:
reusedownloads and validates existing configassertwith existing config verifies equalityassertwithout existing (oroverwrite) uploads the new configThe implicit creation of a new snapshot when none exists in
assertmode is sensible behavior.
287-319: LGTM on common argument registration.The helper cleanly consolidates shared CLI arguments across test entrypoints. The optional
include_wandbparameter provides flexibility for tests that don't use W&B.
393-432: LGTM on remote file operations.The temp file handling with
try/finallycleanup is correct. Usingyaml.safe_load/safe_dumpis the right choice for security.
activatedgeek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Added some comments for consideration, and some simplifications of potentially redundant args.
|
|
||
|
|
||
| def _upload_json(cluster_config: dict, data: dict, remote_path: str): | ||
| with tempfile.NamedTemporaryFile(mode="wt", encoding="utf-8", delete=False) as tmp: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't really need the explicit encoding argument since it is always default in Python 3?
|
|
||
|
|
||
| def _download_remote_yaml(cluster_config: dict, remote_path: str) -> dict: | ||
| with tempfile.NamedTemporaryFile(delete=False) as tmp: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that the reason for operating on the tempfile outside the context is to allow separate open calls. If that is the case, tempfile.mkstemp would be the more appropriate and less laborious option.
| from pathlib import Path | ||
|
|
||
| # Add parent directory to path to import utils | ||
| sys.path.insert(0, str(Path(__file__).parents[1])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consideration: If there are common utilities, then perhaps we can move them under nemo_skills itself, and avoid the ugly sys.path.
Greptile SummaryRefactored Slurm test infrastructure to improve reproducibility by introducing centralized cluster configuration management.
Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Test as Test Script (run_test.py)
participant Utils as utils.prepare_cluster_config_for_test
participant Cluster as Cluster/Remote
participant NemoSkills as nemo_skills.pipeline
Test->>Utils: prepare_cluster_config_for_test(cluster, workspace, mode)
Utils->>NemoSkills: get_cluster_config(cluster)
NemoSkills-->>Utils: cluster_config dict
Utils->>Utils: Resolve workspace mount path to source
Utils->>Utils: Override job_dir = {workspace}/nemo-run-experiments
Utils->>Utils: Resolve container image symlinks
Utils->>Utils: Collect git metadata (commit, status)
alt mode == "reuse"
Utils->>Cluster: Check for existing cluster_config.yaml
Cluster-->>Utils: Download existing config
Utils-->>Test: Return persisted config
else mode == "assert"
Utils->>Cluster: Check for existing cluster_config.yaml
alt exists
Cluster-->>Utils: Download existing config
Utils->>Utils: Compare existing vs new config
alt configs match
Utils-->>Test: Return config
else configs don't match
Utils-->>Test: Raise AssertionError
end
else not exists
Utils->>Cluster: Upload cluster_config.yaml + nemo_skills_commit.json
Utils-->>Test: Return new config
end
else mode == "overwrite"
Utils->>Cluster: Upload cluster_config.yaml + nemo_skills_commit.json
Utils-->>Test: Return new config
end
Test->>NemoSkills: eval/run_cmd with modified cluster config
NemoSkills->>Cluster: Submit jobs to {workspace}/nemo-run-experiments
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7 files reviewed, 1 comment
| cluster_config["job_dir"] = test_job_dir | ||
| job_dir = cluster_config["job_dir"] | ||
|
|
||
| cluster_config["job_dir"] = cluster_config.get("job_dir", test_job_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Redundant line - job_dir was already set in lines 184-189. When ssh_tunnel exists, this adds job_dir at root level even though it was only set in ssh_tunnel. Consider removing this line.
| cluster_config["job_dir"] = cluster_config.get("job_dir", test_job_dir) | |
| # job_dir already set above, no need to set again |
Overview
prepare_cluster_config_for_testto slurm tests, a shared helper responsible for:job_dirinto{workspace}/nemo-run-experiments.cluster_config.yaml+nemo_skills_commit.json) at the workspace root so every test run records its launcher configuration and NeMo-Skills git metadata.add_common_args, removing duplicated argparse boilerplate for--workspace,--cluster,--expname_prefix,--wandb_project, and--cluster_config_mode.Snapshot Layout per Test Workspace
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.