Conversation
Signed-off-by: Anna Warno <awarno@nvidia.com>
Signed-off-by: Anna Warno <awarno@nvidia.com>
|
/ok to test 113acb8 |
Signed-off-by: Anna Warno <awarno@nvidia.com>
…aluator into awarno/local-kill merge
|
/ok to test 8771ceb |
@AWarno, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
|
/ok to test 86f8f5e |
| # Mark job as killed in database if we killed something | ||
| if killed_something: | ||
| job_data.data["killed"] = True | ||
| db.write_job(job_data) |
There was a problem hiding this comment.
question: can you x-check me -- current impl of write_job() does not overwrite it but appends? https://github.com/NVIDIA-NeMo/Eval/blob/cebbd7a6343d258209f8ac5d15456e6dd51d2d56/packages/nemo-evaluator-launcher/src/nemo_evaluator_launcher/common/execdb.py#L125-L149
There was a problem hiding this comment.
I’m not sure, Do you think it may affect anything?
I see:
record = asdict(job)
try:
with open(EXEC_DB_FILE, "a") as f:
f.write(json.dumps(record) + "\n")
I think it does not append
There was a problem hiding this comment.
"a" stays for append. I'm thinking that we get the duplicate entries in this jsonl and by happy coincidence when reading it back on another run, the latest "wins"?
Signed-off-by: Anna Warno <awarno@nvidia.com>
…aluator into awarno/local-kill
|
/ok to test 2ba2758 |
1. Fix local sequential job termination — ensure pending jobs are handled correctly when killing jobs. 2. Clarify kill error message — when a kill command cannot be executed because the job is already finished or canceled. --------- Signed-off-by: Anna Warno <awarno@nvidia.com>
QA RCCA AnalysisDate: 2026-04-20 1. Fix ReferenceThis issue is itself the fix reference — the title
The fix was delivered in the launcher's SLURM executor kill path to handle 2. Root CauseThe launcher's 3. Trigger ConfigTrigger conditions (AND — all must be present):
Deterministic? Yes. Any kill command issued to a PENDING SLURM job before the fix triggers the 4. Nature of BugPrimary classification: Crash/hang bug — process terminates unexpectedly ( Impact scope: All users trying to cancel a SLURM evaluation job that has not yet started (PENDING state). The kill command raises an unhandled exception rather than cleanly cancelling or reporting the job state. NOT affected: Jobs in 5. Functional Test CoverageVerdict: PARTIAL
6. Gaps and LimitationsGap 1 — Mock-based test only: Gap 2 — Error message clarity not tested: Overall gap assessment: Low regression risk. The mock test directly exercises the kill-path code for the PENDING state and asserts no 7. New Test Added
8. ConclusionIssue #303 was a crash in the launcher's SLURM Auto-generated by the issues-rca skill — QA RCCA Analysis v2 |
Fix local sequential job termination — ensure pending jobs are handled correctly when killing jobs.
Clarify kill error message — when a kill command cannot be executed because the job is already finished or canceled.