Skip to content

[Low priority feat] Add rerun rate-limit error resume support#1351

Draft
sgunasekar wants to merge 2 commits into
mainfrom
feat_rerun_ratelimit_errors
Draft

[Low priority feat] Add rerun rate-limit error resume support#1351
sgunasekar wants to merge 2 commits into
mainfrom
feat_rerun_ratelimit_errors

Conversation

@sgunasekar
Copy link
Copy Markdown
Collaborator

@sgunasekar sgunasekar commented Apr 8, 2026

Add --rerun-rate-limit-error as optional flag in generate and eval pipelines.

  • When generations/evals are done with enable-soft-fail=true, the errored rows are skipped during reruns unless more aggressive --rerun-done and ++skip_filled=false are used.
  • This PR adds a flag --rerun-rate-limit-error to rerun only the saved rate-limit soft failures.
  • This is done by rewriting the existing output/async state files to -async state with ratelimit error rows dropped.
  • Note: Curently does not support rerunning merged multi-chunk outputs when the output[-rsX].jsonl file has already been merged and there is no easy way to retrace the chunks and stage.

Summary by CodeRabbit

  • New Features
    • Added --rerun-rate-limit-error and --rerun-ratelimit-errors CLI options to the eval and generate commands, enabling retry of jobs that failed due to rate limit errors even when marked complete.

Signed-off-by: suriya <sgunasekar@nvidia.com>
@sgunasekar sgunasekar force-pushed the feat_rerun_ratelimit_errors branch from 3f52982 to a6559b6 Compare April 8, 2026 09:03
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 8, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3684bfc7-b4f4-46a3-8abc-8ddb11c9b683

📥 Commits

Reviewing files that changed from the base of the PR and between 3f52982 and a6559b6.

📒 Files selected for processing (5)
  • nemo_skills/pipeline/eval.py
  • nemo_skills/pipeline/generate.py
  • nemo_skills/pipeline/utils/eval.py
  • nemo_skills/pipeline/utils/generation.py
  • tests/test_generation.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • nemo_skills/pipeline/eval.py
  • nemo_skills/pipeline/utils/generation.py
  • nemo_skills/pipeline/utils/eval.py

📝 Walkthrough

Walkthrough

A new rerun_ratelimit_errors CLI flag is added across evaluation and generation pipelines to selectively re-run jobs that failed due to rate-limiting errors. The flag propagates through utility functions and modifies job selection logic to identify rate-limit failures, rewrite async resume files to remove those rows, and reschedule affected jobs for re-execution.

Changes

Cohort / File(s) Summary
CLI Entry Points
nemo_skills/pipeline/eval.py, nemo_skills/pipeline/generate.py
Added rerun_ratelimit_errors boolean CLI option with --rerun-rate-limit-error and --rerun-ratelimit-errors aliases, plumbed into downstream utility calls.
Core Re-run Logic
nemo_skills/pipeline/utils/generation.py
Implemented rate-limit error detection via _is_rerunnable_ratelimit_error() helper and async resume file rewriting via _rewrite_async_resume_file_if_needed() helper. Updated get_remaining_jobs() to accept rerun_ratelimit_errors, filter rate-limit failures from resume files, adjust in-memory job tracking, and handle merged chunk edge cases.
Intermediary Layer
nemo_skills/pipeline/utils/eval.py
Updated prepare_eval_commands() signature to accept and propagate rerun_ratelimit_errors parameter to get_remaining_jobs().
Test Updates
tests/test_generation.py
Updated fake_get_generation_cmd() stub to accept rerun_ratelimit_errors parameter for test compatibility.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding support for rerunning rate-limit error scenarios via a new CLI flag.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat_rerun_ratelimit_errors

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
nemo_skills/pipeline/utils/generation.py (1)

185-191: ⚠️ Potential issue | 🟠 Major

Reject --rerun-done and --rerun-rate-limit-error together.

Line 190 returns before any selective rewrite happens, so passing both flags silently turns this into a full rerun and ignores the new option. Please fail fast here instead of accepting a user argument that never takes effect.

Suggested guard
 def get_remaining_jobs(cluster_config, output_dir, random_seeds, chunk_ids, rerun_done, rerun_ratelimit_errors=False):
+    if rerun_done and rerun_ratelimit_errors:
+        raise ValueError("--rerun-done and --rerun-rate-limit-error are mutually exclusive")
     if rerun_done:
         return {seed: copy.deepcopy(chunk_ids) for seed in random_seeds}

As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing. Use dataclass or **kwargs syntax to handle this automatically."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/generation.py` around lines 185 - 191, The
function get_remaining_jobs currently returns early when rerun_done is True,
which silently ignores rerun_ratelimit_errors; update get_remaining_jobs to fail
fast when both rerun_done and rerun_ratelimit_errors are True by checking "if
rerun_done and rerun_ratelimit_errors:" near the top and raising a clear
exception (e.g., ValueError) that explains the flags are mutually exclusive, so
callers are informed instead of silently performing a full rerun; keep the
existing behavior for single flags unchanged.
nemo_skills/pipeline/eval.py (1)

399-420: ⚠️ Potential issue | 🟠 Major

Judge reruns don't see the new flag.

This only forwards rerun_ratelimit_errors into prepare_eval_commands(). The later judge paths still call _generate(...) / judge_step_fn(...) without it, so ns eval --rerun-rate-limit-error can regenerate benchmark rows while reusing stale judge outputs and stale summary metrics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/eval.py` around lines 399 - 420, The judge rerun flag
rerun_ratelimit_errors is passed into prepare_eval_commands but not propagated
into the later judge/generation paths, so calls to _generate(...) and
judge_step_fn(...) must accept and be invoked with the rerun_ratelimit_errors
parameter; update the signatures of _generate and judge_step_fn (and any
intermediary functions invoked in the judge path) to add a
rerun_ratelimit_errors parameter and pass the existing rerun_ratelimit_errors
variable into those calls so that ns eval --rerun-rate-limit-error triggers
regeneration rather than reusing stale judge outputs and summary metrics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/utils/generation.py`:
- Around line 262-286: The selective-rerun branch is directly touching local
filesystem (Path.exists, unlink, open) and must use the cluster file abstraction
instead; update the rerun_ratelimit_errors handling so all checks/manipulations
of status_dir artifacts call through get_tunnel(...).run(...) (or the existing
tunnel API) rather than using Path/Path.unlink/open locally: when checking
merged_output_file existence use the tunnel to test the remote file, call
_rewrite_async_resume_file_if_needed via the tunnel or change it to accept/use a
tunnel so it rewrites the remote .jsonl-async correctly, and replace
done_file.unlink() with a remote remove via the tunnel (and append to
missing_jobs / remove from done_jobs afterward as now). Reference
functions/variables: rerun_ratelimit_errors block, get_chunked_rs_filename,
_rewrite_async_resume_file_if_needed, expected_files, done_jobs, missing_jobs,
and get_tunnel.
- Around line 347-379: The function _rewrite_async_resume_file_if_needed
currently opens async_output_path with "wt" which truncates the file when
source_path == async_output_path; instead compute preserved_rows and
rerunnable_found first (as you already do), then when writing, create a
temporary file in the same directory (e.g.,
async_output_path.with_suffix(".tmp") or similar), open that temp for writing
and write all preserved_rows to it, fsync if desired, then atomically replace
the original using Path.replace(temp_path, async_output_path); only unlink
output_path if not rewriting_existing_async after the atomic replace — this
ensures no in-place truncation of async_output_path and preserves resume state
on failures.

---

Outside diff comments:
In `@nemo_skills/pipeline/eval.py`:
- Around line 399-420: The judge rerun flag rerun_ratelimit_errors is passed
into prepare_eval_commands but not propagated into the later judge/generation
paths, so calls to _generate(...) and judge_step_fn(...) must accept and be
invoked with the rerun_ratelimit_errors parameter; update the signatures of
_generate and judge_step_fn (and any intermediary functions invoked in the judge
path) to add a rerun_ratelimit_errors parameter and pass the existing
rerun_ratelimit_errors variable into those calls so that ns eval
--rerun-rate-limit-error triggers regeneration rather than reusing stale judge
outputs and summary metrics.

In `@nemo_skills/pipeline/utils/generation.py`:
- Around line 185-191: The function get_remaining_jobs currently returns early
when rerun_done is True, which silently ignores rerun_ratelimit_errors; update
get_remaining_jobs to fail fast when both rerun_done and rerun_ratelimit_errors
are True by checking "if rerun_done and rerun_ratelimit_errors:" near the top
and raising a clear exception (e.g., ValueError) that explains the flags are
mutually exclusive, so callers are informed instead of silently performing a
full rerun; keep the existing behavior for single flags unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 54bb34f5-f2ba-4418-8e00-5a8c6a5a46b9

📥 Commits

Reviewing files that changed from the base of the PR and between 74650dd and 3f52982.

📒 Files selected for processing (5)
  • nemo_skills/pipeline/eval.py
  • nemo_skills/pipeline/generate.py
  • nemo_skills/pipeline/utils/eval.py
  • nemo_skills/pipeline/utils/generation.py
  • tests/test_generation.py

Comment on lines +262 to +286
if rerun_ratelimit_errors:
for seed in random_seeds:
for chunk_id in list(missing_jobs.get(seed, [])):
# for partially done seed/chunk_id combo rewrite the current -async file by simply dropping ratelimit error rows
output_file = get_chunked_rs_filename(status_dir, random_seed=seed, chunk_id=chunk_id)
_rewrite_async_resume_file_if_needed(output_file)

for seed in random_seeds:
for chunk_id in list(done_jobs[seed]):
# for fully done seed/chunk_id combo - of the chunks have been merged, there is no easy way to rerun just the ratelimit error rows so raise and ask user to --rerun-done
if chunk_id is not None and len(chunk_ids) > 1:
merged_output_file = get_chunked_rs_filename(status_dir, random_seed=seed, chunk_id=None)
if Path(merged_output_file).exists():
raise ValueError(
"Cannot use --rerun-rate-limit-error for completed chunked outputs because "
f"`{merged_output_file}` has already been merged. Use --rerun-done to fully rerun the seed."
)
# for fully done seed/chunk_id combo - if the chunks have not been merged, rewrite the .jsonl to .jsonl-async file by dropping ratelimit error rows and remove .done files
output_file = get_chunked_rs_filename(status_dir, random_seed=seed, chunk_id=chunk_id)
if _rewrite_async_resume_file_if_needed(output_file):
done_file = Path(expected_files[(seed, chunk_id)])
if done_file.exists():
done_file.unlink()
missing_jobs[seed].append(chunk_id)
done_jobs[seed].remove(chunk_id)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This selective-rerun branch bypasses the cluster file abstraction.

The normal status probing above goes through get_tunnel(...).run(...), but this path switches to local Path.exists(), open(), and unlink() against status_dir. On executor == "slurm", those artifacts are remote, so --rerun-rate-limit-error can miss merged outputs and leave the real *.done / *-async files untouched.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/generation.py` around lines 262 - 286, The
selective-rerun branch is directly touching local filesystem (Path.exists,
unlink, open) and must use the cluster file abstraction instead; update the
rerun_ratelimit_errors handling so all checks/manipulations of status_dir
artifacts call through get_tunnel(...).run(...) (or the existing tunnel API)
rather than using Path/Path.unlink/open locally: when checking
merged_output_file existence use the tunnel to test the remote file, call
_rewrite_async_resume_file_if_needed via the tunnel or change it to accept/use a
tunnel so it rewrites the remote .jsonl-async correctly, and replace
done_file.unlink() with a remote remove via the tunnel (and append to
missing_jobs / remove from done_jobs afterward as now). Reference
functions/variables: rerun_ratelimit_errors block, get_chunked_rs_filename,
_rewrite_async_resume_file_if_needed, expected_files, done_jobs, missing_jobs,
and get_tunnel.

Comment on lines +347 to +379
def _rewrite_async_resume_file_if_needed(output_file: str, async_position_key: str = "_async_position") -> bool:
output_path = Path(output_file)
async_output_path = Path(f"{output_file}-async")
if output_path.exists():
source_path = output_path
rewriting_existing_async = False
elif async_output_path.exists():
source_path = async_output_path
rewriting_existing_async = True
else:
return False

preserved_rows = []
rerunnable_found = False
with open(source_path, "rt", encoding="utf-8") as fin:
for idx, line in enumerate(fin):
row = json.loads(line)
if _is_rerunnable_ratelimit_error(row):
rerunnable_found = True
continue
if not rewriting_existing_async or async_position_key not in row:
row[async_position_key] = idx
preserved_rows.append(row)

if not rerunnable_found:
return False

with open(async_output_path, "wt", encoding="utf-8") as fout:
for row in preserved_rows:
fout.write(json.dumps(row) + "\n")
if not rewriting_existing_async:
output_path.unlink()
LOG.info(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't rewrite the only -async file in place.

When source_path == async_output_path, opening it with "wt" truncates the current resume state before the replacement is durable. Any write failure or interruption here loses the preserved-row metadata and makes the rerun unrecoverable; write to a temp file and replace() it atomically instead.

As per coding guidelines, "When adding new benchmarks, avoid data loss by doing all computation before re-opening files for writing; ensure computation completes before file writes to prevent accidental data loss if code fails."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/generation.py` around lines 347 - 379, The
function _rewrite_async_resume_file_if_needed currently opens async_output_path
with "wt" which truncates the file when source_path == async_output_path;
instead compute preserved_rows and rerunnable_found first (as you already do),
then when writing, create a temporary file in the same directory (e.g.,
async_output_path.with_suffix(".tmp") or similar), open that temp for writing
and write all preserved_rows to it, fsync if desired, then atomically replace
the original using Path.replace(temp_path, async_output_path); only unlink
output_path if not rewriting_existing_async after the atomic replace — this
ensures no in-place truncation of async_output_path and preserves resume state
on failures.

@sgunasekar sgunasekar marked this pull request as draft April 8, 2026 18:56
Signed-off-by: suriya <sgunasekar@nvidia.com>
@sgunasekar sgunasekar force-pushed the feat_rerun_ratelimit_errors branch from 4abb16d to 4b10a3a Compare April 8, 2026 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant