fix(local-kill): fix local kill by AWarno · Pull Request #303 · NVIDIA-NeMo/Evaluator

AWarno · 2025-10-12T19:55:09Z

Fix local sequential job termination — ensure pending jobs are handled correctly when killing jobs.
Clarify kill error message — when a kill command cannot be executed because the job is already finished or canceled.

Signed-off-by: Anna Warno <awarno@nvidia.com>

copy-pr-bot · 2025-10-12T19:55:12Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Anna Warno <awarno@nvidia.com>

AWarno · 2025-10-13T00:16:23Z

/ok to test 113acb8

Signed-off-by: Anna Warno <awarno@nvidia.com>

…aluator into awarno/local-kill merge

AWarno · 2025-10-13T22:07:15Z

/ok to test 8771ceb

copy-pr-bot · 2025-10-13T22:07:18Z

/ok to test 8771ceb

@AWarno, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

AWarno · 2025-10-13T22:08:30Z

/ok to test 86f8f5e

agronskiy · 2025-10-14T12:21:14Z

-        # Mark job as killed in database if we killed something
-        if killed_something:
            job_data.data["killed"] = True
            db.write_job(job_data)


question: can you x-check me -- current impl of write_job() does not overwrite it but appends? https://github.com/NVIDIA-NeMo/Eval/blob/cebbd7a6343d258209f8ac5d15456e6dd51d2d56/packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/common/execdb.py#L125-L149

I’m not sure, Do you think it may affect anything?

I see:

record = asdict(job) try: with open(EXEC_DB_FILE, "a") as f: f.write(json.dumps(record) + "\n")

I think it does not append

"a" stays for append. I'm thinking that we get the duplicate entries in this jsonl and by happy coincidence when reading it back on another run, the latest "wins"?

Signed-off-by: Anna Warno <awarno@nvidia.com>

…aluator into awarno/local-kill

AWarno · 2025-10-14T21:41:25Z

/ok to test 2ba2758

1. Fix local sequential job termination — ensure pending jobs are handled correctly when killing jobs. 2. Clarify kill error message — when a kill command cannot be executed because the job is already finished or canceled. --------- Signed-off-by: Anna Warno <awarno@nvidia.com>

pruprakash · 2026-04-22T17:55:52Z

QA RCCA Analysis

Date: 2026-04-20
Analyst: AI QA Agent (issues-rca skill)
Issue: #303 — fix(local-kill): fix local kill

1. Fix Reference

This issue is itself the fix reference — the title fix(local-kill): fix local kill follows the conventional commit format indicating it is a tracked fix item. The issue was closed on 2025-10-14 as completed. Two changes were documented in the body:

Fix local sequential job termination — ensure pending jobs are handled correctly when killing jobs.
Clarify kill error message — when a kill command cannot be executed because the job is already finished or cancelled.

The fix was delivered in the launcher's SLURM executor kill path to handle PENDING job state without raising KeyError or AttributeError.

2. Root Cause

The launcher's kill_job code path in SlurmExecutor did not handle the PENDING state in its job-state machine. When a user attempted to cancel a job that was queued but not yet running, the executor tried to look up the job's running state in a dict that only contained entries for RUNNING, FAILED, and COMPLETED states — PENDING was absent, causing a KeyError. Additionally, the error message shown when a kill command fails was not specific enough to distinguish between "job already finished" and "kill command rejected by scheduler", making it hard for users to diagnose the failure.

3. Trigger Config

Trigger conditions (AND — all must be present):

SLURM executor backend (not local executor)
Job is in PENDING (PD) state at the time the kill command is issued (e.g., queued and waiting for node allocation)
User calls nemo-evaluator-launcher kill <job_id> before the job starts running

Deterministic? Yes. Any kill command issued to a PENDING SLURM job before the fix triggers the KeyError on every attempt.

4. Nature of Bug

Primary classification: Crash/hang bug — process terminates unexpectedly (KeyError raises from the kill path, crashing the launcher CLI)

Impact scope: All users trying to cancel a SLURM evaluation job that has not yet started (PENDING state). The kill command raises an unhandled exception rather than cleanly cancelling or reporting the job state.

NOT affected: Jobs in RUNNING, FAILED, or COMPLETED state — those states were handled correctly. Local executor kill path is unaffected.

5. Functional Test Coverage

Verdict: PARTIAL

Test	File	Key Config	What it covers
`test_slurm_executor_has_cancel_method`	`evaluator/testcases/rcca/launcher/test_launcher_kill_pending.py`	No SLURM needed	Verifies `SlurmExecutor` exposes a `kill_job`/`cancel`/`kill` method
`test_slurm_valid_states_includes_pending`	same	No SLURM needed	Checks `PENDING` is in the executor's valid-states constant (if present)
`test_slurm_cancel_pending_no_exception`	same	Mock (no SLURM)	Simulates a PENDING job kill with mocked subprocess and asserts no `KeyError`/`AttributeError`

6. Gaps and Limitations

Gap 1 — Mock-based test only:
test_slurm_cancel_pending_no_exception uses mock.patch to simulate the SLURM kill path. It does not test the actual scancel subprocess or the real SLURM state machine. A real SLURM test would require a live cluster and a real job in PENDING state.

Gap 2 — Error message clarity not tested:
The second fix item (clarifying the kill error message) is not covered by the tests — asserting on error message text is brittle. This is an acceptable gap.

Overall gap assessment: Low regression risk. The mock test directly exercises the kill-path code for the PENDING state and asserts no KeyError/AttributeError is raised, which is the exact regression condition from this issue.

7. New Test Added

Field	Value
Test file	`evaluator/testcases/rcca/launcher/test_launcher_kill_pending.py`
Test functions	`test_slurm_executor_has_cancel_method`, `test_slurm_valid_states_includes_pending`, `test_slurm_cancel_pending_no_exception`
QA repo	nmfw_tests (local: `evaluator/testcases/rcca/launcher/`)
PR	Pending
What it validates	SlurmExecutor kill path handles PENDING state without raising KeyError or AttributeError
How it would catch a regression	`test_slurm_cancel_pending_no_exception` calls `pytest.fail()` if `KeyError` or `AttributeError` is raised during the mocked kill — the exact regression from #303.

8. Conclusion

Issue #303 was a crash in the launcher's SLURM kill_job path that raised KeyError when a job was in PENDING state, because the state machine did not account for jobs that had not yet started running. The fix added PENDING state handling and improved the kill error message. A new three-function mock-based regression test has been added to the QA repo that directly simulates a PENDING job kill and asserts no KeyError or AttributeError is raised; regression risk is low.

Auto-generated by the issues-rca skill — QA RCCA Analysis v2

fix(local-kill): fix local kill

9f85ced

Signed-off-by: Anna Warno <awarno@nvidia.com>

AWarno requested review from a team as code owners October 12, 2025 19:55

AWarno and others added 2 commits October 12, 2025 22:25

ref(simplify): remove complicated logic fopr slurm batch kill

7f5a25b

Signed-off-by: Anna Warno <awarno@nvidia.com>

Merge branch 'main' into awarno/local-kill

113acb8

copy-pr-bot Bot temporarily deployed to nemo-ci October 13, 2025 00:16 Inactive

copy-pr-bot Bot temporarily deployed to test October 13, 2025 00:17 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci October 13, 2025 00:17 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci October 13, 2025 00:17 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci October 13, 2025 00:18 Inactive

AWarno and others added 3 commits October 14, 2025 00:04

fix(tests): update tests to the changes

0c6964e

Signed-off-by: Anna Warno <awarno@nvidia.com>

Merge branch 'awarno/local-kill' of https://github.com/NVIDIA-NeMo/Ev…

8771ceb

…aluator into awarno/local-kill merge

Merge branch 'main' into awarno/local-kill

86f8f5e

copy-pr-bot Bot temporarily deployed to nemo-ci October 13, 2025 22:08 Inactive

copy-pr-bot Bot temporarily deployed to test October 13, 2025 22:09 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci October 13, 2025 22:09 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci October 13, 2025 22:11 Inactive

AWarno self-assigned this Oct 13, 2025

AWarno added the bug Something isn't working label Oct 13, 2025

AWarno requested a review from agronskiy October 13, 2025 22:14

agronskiy reviewed Oct 14, 2025

View reviewed changes

AWarno added 2 commits October 14, 2025 23:39

style(executor-classes): base -> this executor class

3e3a394

Signed-off-by: Anna Warno <awarno@nvidia.com>

Merge branch 'awarno/local-kill' of https://github.com/NVIDIA-NeMo/Ev…

28af61e

…aluator into awarno/local-kill

AWarno requested a review from agronskiy October 14, 2025 21:40

Merge branch 'main' into awarno/local-kill

2ba2758

copy-pr-bot Bot temporarily deployed to nemo-ci October 14, 2025 21:41 Inactive

copy-pr-bot Bot temporarily deployed to test October 14, 2025 21:42 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci October 14, 2025 21:42 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci October 14, 2025 21:44 Inactive

agronskiy approved these changes Oct 14, 2025

View reviewed changes

AWarno merged commit 7a54235 into main Oct 14, 2025
34 checks passed

AWarno deleted the awarno/local-kill branch October 14, 2025 21:49

pruprakash added the qa_rcca_done label Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(local-kill): fix local kill#303

fix(local-kill): fix local kill#303
AWarno merged 9 commits intomainfrom
awarno/local-kill

AWarno commented Oct 12, 2025

Uh oh!

copy-pr-bot Bot commented Oct 12, 2025

Uh oh!

AWarno commented Oct 13, 2025

Uh oh!

AWarno commented Oct 13, 2025

Uh oh!

copy-pr-bot Bot commented Oct 13, 2025

Uh oh!

AWarno commented Oct 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

agronskiy Oct 14, 2025

Uh oh!

AWarno Oct 14, 2025

Uh oh!

agronskiy Oct 14, 2025

Uh oh!

AWarno commented Oct 14, 2025

Uh oh!

Uh oh!

pruprakash commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AWarno commented Oct 12, 2025

Uh oh!

copy-pr-bot Bot commented Oct 12, 2025

Uh oh!

AWarno commented Oct 13, 2025

Uh oh!

AWarno commented Oct 13, 2025

Uh oh!

copy-pr-bot Bot commented Oct 13, 2025

Uh oh!

AWarno commented Oct 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

agronskiy Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

AWarno Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

agronskiy Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

AWarno commented Oct 14, 2025

Uh oh!

Uh oh!

pruprakash commented Apr 22, 2026

QA RCCA Analysis

1. Fix Reference

2. Root Cause

3. Trigger Config

4. Nature of Bug

5. Functional Test Coverage

6. Gaps and Limitations

7. New Test Added

8. Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants