[Low priority feat] Add rerun rate-limit error resume support by sgunasekar · Pull Request #1351 · NVIDIA-NeMo/Skills

sgunasekar · 2026-04-08T08:59:28Z

Add --rerun-rate-limit-error as optional flag in generate and eval pipelines.

When generations/evals are done with enable-soft-fail=true, the errored rows are skipped during reruns unless more aggressive --rerun-done and ++skip_filled=false are used.
This PR adds a flag --rerun-rate-limit-error to rerun only the saved rate-limit soft failures.
This is done by rewriting the existing output/async state files to -async state with ratelimit error rows dropped.
Note: Curently does not support rerunning merged multi-chunk outputs when the output[-rsX].jsonl file has already been merged and there is no easy way to retrace the chunks and stage.

Summary by CodeRabbit

New Features
- Added --rerun-rate-limit-error and --rerun-ratelimit-errors CLI options to the eval and generate commands, enabling retry of jobs that failed due to rate limit errors even when marked complete.

Signed-off-by: suriya <sgunasekar@nvidia.com>

coderabbitai · 2026-04-08T09:09:55Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3684bfc7-b4f4-46a3-8abc-8ddb11c9b683

📥 Commits

Reviewing files that changed from the base of the PR and between 3f52982 and a6559b6.

📒 Files selected for processing (5)

nemo_skills/pipeline/eval.py
nemo_skills/pipeline/generate.py
nemo_skills/pipeline/utils/eval.py
nemo_skills/pipeline/utils/generation.py
tests/test_generation.py

🚧 Files skipped from review as they are similar to previous changes (3)

nemo_skills/pipeline/eval.py
nemo_skills/pipeline/utils/generation.py
nemo_skills/pipeline/utils/eval.py

📝 Walkthrough

Walkthrough

A new rerun_ratelimit_errors CLI flag is added across evaluation and generation pipelines to selectively re-run jobs that failed due to rate-limiting errors. The flag propagates through utility functions and modifies job selection logic to identify rate-limit failures, rewrite async resume files to remove those rows, and reschedule affected jobs for re-execution.

Changes

Cohort / File(s)	Summary
CLI Entry Points `nemo_skills/pipeline/eval.py`, `nemo_skills/pipeline/generate.py`	Added `rerun_ratelimit_errors` boolean CLI option with `--rerun-rate-limit-error` and `--rerun-ratelimit-errors` aliases, plumbed into downstream utility calls.
Core Re-run Logic `nemo_skills/pipeline/utils/generation.py`	Implemented rate-limit error detection via `_is_rerunnable_ratelimit_error()` helper and async resume file rewriting via `_rewrite_async_resume_file_if_needed()` helper. Updated `get_remaining_jobs()` to accept `rerun_ratelimit_errors`, filter rate-limit failures from resume files, adjust in-memory job tracking, and handle merged chunk edge cases.
Intermediary Layer `nemo_skills/pipeline/utils/eval.py`	Updated `prepare_eval_commands()` signature to accept and propagate `rerun_ratelimit_errors` parameter to `get_remaining_jobs()`.
Test Updates `tests/test_generation.py`	Updated `fake_get_generation_cmd()` stub to accept `rerun_ratelimit_errors` parameter for test compatibility.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding support for rerunning rate-limit error scenarios via a new CLI flag.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat_rerun_ratelimit_errors

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

nemo_skills/pipeline/utils/generation.py (1)
185-191: ⚠️ Potential issue | 🟠 Major

Reject --rerun-done and --rerun-rate-limit-error together.

Line 190 returns before any selective rewrite happens, so passing both flags silently turns this into a full rerun and ignores the new option. Please fail fast here instead of accepting a user argument that never takes effect.
Suggested guard
 def get_remaining_jobs(cluster_config, output_dir, random_seeds, chunk_ids, rerun_done, rerun_ratelimit_errors=False):
+    if rerun_done and rerun_ratelimit_errors:
+        raise ValueError("--rerun-done and --rerun-rate-limit-error are mutually exclusive")
     if rerun_done:
         return {seed: copy.deepcopy(chunk_ids) for seed in random_seeds}
As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/generation.py` around lines 185 - 191, The
function get_remaining_jobs currently returns early when rerun_done is True,
which silently ignores rerun_ratelimit_errors; update get_remaining_jobs to fail
fast when both rerun_done and rerun_ratelimit_errors are True by checking "if
rerun_done and rerun_ratelimit_errors:" near the top and raising a clear
exception (e.g., ValueError) that explains the flags are mutually exclusive, so
callers are informed instead of silently performing a full rerun; keep the
existing behavior for single flags unchanged.
nemo_skills/pipeline/eval.py (1)
399-420: ⚠️ Potential issue | 🟠 Major

Judge reruns don't see the new flag.

This only forwards rerun_ratelimit_errors into prepare_eval_commands(). The later judge paths still call _generate(...) / judge_step_fn(...) without it, so ns eval --rerun-rate-limit-error can regenerate benchmark rows while reusing stale judge outputs and stale summary metrics.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 399 - 420, The judge rerun flag
rerun_ratelimit_errors is passed into prepare_eval_commands but not propagated
into the later judge/generation paths, so calls to _generate(...) and
judge_step_fn(...) must accept and be invoked with the rerun_ratelimit_errors
parameter; update the signatures of _generate and judge_step_fn (and any
intermediary functions invoked in the judge path) to add a
rerun_ratelimit_errors parameter and pass the existing rerun_ratelimit_errors
variable into those calls so that ns eval --rerun-rate-limit-error triggers
regeneration rather than reusing stale judge outputs and summary metrics.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/utils/generation.py`:
- Around line 262-286: The selective-rerun branch is directly touching local
filesystem (Path.exists, unlink, open) and must use the cluster file abstraction
instead; update the rerun_ratelimit_errors handling so all checks/manipulations
of status_dir artifacts call through get_tunnel(...).run(...) (or the existing
tunnel API) rather than using Path/Path.unlink/open locally: when checking
merged_output_file existence use the tunnel to test the remote file, call
_rewrite_async_resume_file_if_needed via the tunnel or change it to accept/use a
tunnel so it rewrites the remote .jsonl-async correctly, and replace
done_file.unlink() with a remote remove via the tunnel (and append to
missing_jobs / remove from done_jobs afterward as now). Reference
functions/variables: rerun_ratelimit_errors block, get_chunked_rs_filename,
_rewrite_async_resume_file_if_needed, expected_files, done_jobs, missing_jobs,
and get_tunnel.
- Around line 347-379: The function _rewrite_async_resume_file_if_needed
currently opens async_output_path with "wt" which truncates the file when
source_path == async_output_path; instead compute preserved_rows and
rerunnable_found first (as you already do), then when writing, create a
temporary file in the same directory (e.g.,
async_output_path.with_suffix(".tmp") or similar), open that temp for writing
and write all preserved_rows to it, fsync if desired, then atomically replace
the original using Path.replace(temp_path, async_output_path); only unlink
output_path if not rewriting_existing_async after the atomic replace — this
ensures no in-place truncation of async_output_path and preserves resume state
on failures.

---

Outside diff comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 399-420: The judge rerun flag rerun_ratelimit_errors is passed
into prepare_eval_commands but not propagated into the later judge/generation
paths, so calls to _generate(...) and judge_step_fn(...) must accept and be
invoked with the rerun_ratelimit_errors parameter; update the signatures of
_generate and judge_step_fn (and any intermediary functions invoked in the judge
path) to add a rerun_ratelimit_errors parameter and pass the existing
rerun_ratelimit_errors variable into those calls so that ns eval
--rerun-rate-limit-error triggers regeneration rather than reusing stale judge
outputs and summary metrics.

In `@nemo_skills/pipeline/utils/generation.py`:
- Around line 185-191: The function get_remaining_jobs currently returns early
when rerun_done is True, which silently ignores rerun_ratelimit_errors; update
get_remaining_jobs to fail fast when both rerun_done and rerun_ratelimit_errors
are True by checking "if rerun_done and rerun_ratelimit_errors:" near the top
and raising a clear exception (e.g., ValueError) that explains the flags are
mutually exclusive, so callers are informed instead of silently performing a
full rerun; keep the existing behavior for single flags unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 54bb34f5-f2ba-4418-8e00-5a8c6a5a46b9

📥 Commits

Reviewing files that changed from the base of the PR and between 74650dd and 3f52982.

📒 Files selected for processing (5)

nemo_skills/pipeline/eval.py
nemo_skills/pipeline/generate.py
nemo_skills/pipeline/utils/eval.py
nemo_skills/pipeline/utils/generation.py
tests/test_generation.py

coderabbitai · 2026-04-08T09:09:58Z

+    if rerun_ratelimit_errors:
+        for seed in random_seeds:
+            for chunk_id in list(missing_jobs.get(seed, [])):
+                # for partially done seed/chunk_id combo rewrite the current -async file by simply dropping ratelimit error rows
+                output_file = get_chunked_rs_filename(status_dir, random_seed=seed, chunk_id=chunk_id)
+                _rewrite_async_resume_file_if_needed(output_file)
+
+        for seed in random_seeds:
+            for chunk_id in list(done_jobs[seed]):
+                # for fully done seed/chunk_id combo - of the chunks have been merged, there is no easy way to rerun just the ratelimit error rows so raise and ask user to --rerun-done
+                if chunk_id is not None and len(chunk_ids) > 1:
+                    merged_output_file = get_chunked_rs_filename(status_dir, random_seed=seed, chunk_id=None)
+                    if Path(merged_output_file).exists():
+                        raise ValueError(
+                            "Cannot use --rerun-rate-limit-error for completed chunked outputs because "
+                            f"`{merged_output_file}` has already been merged. Use --rerun-done to fully rerun the seed."
+                        )
+                # for fully done seed/chunk_id combo - if the chunks have not been merged, rewrite the .jsonl to .jsonl-async file by dropping ratelimit error rows and remove .done files
+                output_file = get_chunked_rs_filename(status_dir, random_seed=seed, chunk_id=chunk_id)
+                if _rewrite_async_resume_file_if_needed(output_file):
+                    done_file = Path(expected_files[(seed, chunk_id)])
+                    if done_file.exists():
+                        done_file.unlink()
+                    missing_jobs[seed].append(chunk_id)
+                    done_jobs[seed].remove(chunk_id)


⚠️ Potential issue | 🟠 Major

This selective-rerun branch bypasses the cluster file abstraction.

The normal status probing above goes through get_tunnel(...).run(...), but this path switches to local Path.exists(), open(), and unlink() against status_dir. On executor == "slurm", those artifacts are remote, so --rerun-rate-limit-error can miss merged outputs and leave the real *.done / *-async files untouched.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/generation.py` around lines 262 - 286, The selective-rerun branch is directly touching local filesystem (Path.exists, unlink, open) and must use the cluster file abstraction instead; update the rerun_ratelimit_errors handling so all checks/manipulations of status_dir artifacts call through get_tunnel(...).run(...) (or the existing tunnel API) rather than using Path/Path.unlink/open locally: when checking merged_output_file existence use the tunnel to test the remote file, call _rewrite_async_resume_file_if_needed via the tunnel or change it to accept/use a tunnel so it rewrites the remote .jsonl-async correctly, and replace done_file.unlink() with a remote remove via the tunnel (and append to missing_jobs / remove from done_jobs afterward as now). Reference functions/variables: rerun_ratelimit_errors block, get_chunked_rs_filename, _rewrite_async_resume_file_if_needed, expected_files, done_jobs, missing_jobs, and get_tunnel.

coderabbitai · 2026-04-08T09:09:58Z

+def _rewrite_async_resume_file_if_needed(output_file: str, async_position_key: str = "_async_position") -> bool:
+    output_path = Path(output_file)
+    async_output_path = Path(f"{output_file}-async")
+    if output_path.exists():
+        source_path = output_path
+        rewriting_existing_async = False
+    elif async_output_path.exists():
+        source_path = async_output_path
+        rewriting_existing_async = True
+    else:
+        return False
+
+    preserved_rows = []
+    rerunnable_found = False
+    with open(source_path, "rt", encoding="utf-8") as fin:
+        for idx, line in enumerate(fin):
+            row = json.loads(line)
+            if _is_rerunnable_ratelimit_error(row):
+                rerunnable_found = True
+                continue
+            if not rewriting_existing_async or async_position_key not in row:
+                row[async_position_key] = idx
+            preserved_rows.append(row)
+
+    if not rerunnable_found:
+        return False
+
+    with open(async_output_path, "wt", encoding="utf-8") as fout:
+        for row in preserved_rows:
+            fout.write(json.dumps(row) + "\n")
+    if not rewriting_existing_async:
+        output_path.unlink()
+    LOG.info(


⚠️ Potential issue | 🟠 Major

Don't rewrite the only -async file in place.

When source_path == async_output_path, opening it with "wt" truncates the current resume state before the replacement is durable. Any write failure or interruption here loses the preserved-row metadata and makes the rerun unrecoverable; write to a temp file and replace() it atomically instead.

As per coding guidelines, "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/generation.py` around lines 347 - 379, The function _rewrite_async_resume_file_if_needed currently opens async_output_path with "wt" which truncates the file when source_path == async_output_path; instead compute preserved_rows and rerunnable_found first (as you already do), then when writing, create a temporary file in the same directory (e.g., async_output_path.with_suffix(".tmp") or similar), open that temp for writing and write all preserved_rows to it, fsync if desired, then atomically replace the original using Path.replace(temp_path, async_output_path); only unlink output_path if not rewriting_existing_async after the atomic replace — this ensures no in-place truncation of async_output_path and preserves resume state on failures.

Signed-off-by: suriya <sgunasekar@nvidia.com>

Add rerun rate-limit error resume support

a6559b6

Signed-off-by: suriya <sgunasekar@nvidia.com>

sgunasekar force-pushed the feat_rerun_ratelimit_errors branch from 3f52982 to a6559b6 Compare April 8, 2026 09:03

coderabbitai Bot reviewed Apr 8, 2026

View reviewed changes

sgunasekar marked this pull request as draft April 8, 2026 18:56

refactor feature to remove rate limit rows at runtime

4b10a3a

Signed-off-by: suriya <sgunasekar@nvidia.com>

sgunasekar force-pushed the feat_rerun_ratelimit_errors branch from 4abb16d to 4b10a3a Compare April 8, 2026 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Low priority feat] Add rerun rate-limit error resume support#1351

[Low priority feat] Add rerun rate-limit error resume support#1351
sgunasekar wants to merge 2 commits into
mainfrom
feat_rerun_ratelimit_errors

sgunasekar commented Apr 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 8, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 8, 2026

Uh oh!

coderabbitai Bot Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sgunasekar commented Apr 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sgunasekar commented Apr 8, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 8, 2026 •

edited

Loading