Fix auxilary pod cleanup by tdewanNvidia · Pull Request #687 · NVIDIA/OSMO

tdewanNvidia · 2026-03-11T18:55:06Z

Description

Fix auxiliary pod cleanup and daemonset recreation race condition

hen the DaemonSet manager cleans up validator pods, it was using grace_period_seconds=0 which sends SIGKILL immediately. This prevented Python finally blocks from executing, leaving auxiliary pods orphaned and consuming GPU resources indefinitely. Changed to grace_period_seconds=30 and added a SIGTERM handler in NodeTestBase that converts SIGTERM into SystemExit, which triggers finally blocks for proper cleanup.

Issue - None

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

Summary by CodeRabbit

Bug Fixes
- Improved DaemonSet redeployment cleanup by waiting for full removal before recreation and increasing pod termination grace period to reduce orphaned pods.
- Added graceful shutdown handling for termination signals so cleanup routines run reliably during shutdown.

coderabbitai · 2026-03-11T18:55:27Z

📝 Walkthrough

Walkthrough

Adds polling to wait for DaemonSet removal before redeployment, increases pod deletion grace period to 30 seconds, and registers a SIGTERM handler that raises SystemExit during NodeTestBase initialization.

Changes

Cohort / File(s)	Summary
DaemonSet Manager Cleanup `src/operator/backend_test_runner/daemonset_manager.py`	Added `_wait_for_daemonset_deletion(timeout: int = 120)` to poll until the DaemonSet is removed (2s intervals, 120s timeout); changed `delete_namespaced_daemon_set` pod deletion `grace_period_seconds` from 0 to 30; `deploy_and_wait()` now calls the new wait helper after deletion before recreating the DaemonSet.
Graceful Shutdown Support `src/operator/utils/node_validation_test/test_base.py`	Added `_sigterm_handler(signum, frame)` that logs SIGTERM and raises `SystemExit(128 + signum)`; added `register_graceful_shutdown()` to install the SIGTERM handler; registered this handler in `NodeTestBase.__init__`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I watched the daemonset vanish one night,
I waited in hush as the API took flight,
A SIGTERM I heard, so I bowed with a sigh,
I raised SystemExit and let cleanup try,
Now pods sleep in order beneath the moonlight.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Fix auxilary pod cleanup' directly addresses the main purpose of the PR, which is fixing cleanup of auxiliary pods by addressing a race condition with graceful shutdown.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch tdewan/fix_cleanup

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/operator/backend_test_runner/daemonset_manager.py (1)
330-333: ⚠️ Potential issue | 🟡 Minor

Final cleanup doesn't wait for deletion to complete.

The finally block calls delete_daemonset() but not _wait_for_daemonset_deletion(). If deploy_and_wait() is called again quickly (e.g., in a retry loop), the old DaemonSet may still exist, causing a 409 Conflict despite the wait logic at line 318.
🛡️ Proposed fix: wait for deletion in finally block
         finally:
             # Clean up resources
             self.delete_daemonset()
+            self._wait_for_daemonset_deletion()
         return status
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/operator/backend_test_runner/daemonset_manager.py` around lines 330 -
333, The finally block in deploy_and_wait() currently calls delete_daemonset()
but doesn't wait for completion, so update the finally to call
self.delete_daemonset() followed by waiting for deletion (call
self._wait_for_daemonset_deletion() or the existing public wait helper) and
handle/ignore expected not-found errors; ensure you reference the
self.delete_daemonset() and self._wait_for_daemonset_deletion() methods so the
old DaemonSet is fully removed before any retry or subsequent deploy_and_wait()
invocation.

🧹 Nitpick comments (1)

src/operator/utils/node_validation_test/test_base.py (1)

159-160: Consider guarding against redundant signal handler registration.

register_graceful_shutdown() is called every time a NodeTestBase is instantiated. While signal handler re-registration is idempotent, if multiple NodeTestBase instances are created it results in redundant calls. Consider using a module-level flag to ensure one-time registration.

♻️ Optional: guard with a module-level flag

+_graceful_shutdown_registered = False
+
 def register_graceful_shutdown() -> None:
     """Register a SIGTERM handler that triggers finally-block cleanup.

     Kubernetes sends SIGTERM before SIGKILL during pod termination.
     Python's default SIGTERM handler terminates without running finally blocks.
     This converts SIGTERM into SystemExit, which does trigger finally blocks,
     allowing validators to clean up resources (e.g. benchmark pods) on shutdown.
     """
+    global _graceful_shutdown_registered
+    if _graceful_shutdown_registered:
+        return
     signal.signal(signal.SIGTERM, _sigterm_handler)
+    _graceful_shutdown_registered = True

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/operator/utils/node_validation_test/test_base.py` around lines 159 - 160,
The call to register_graceful_shutdown() in NodeTestBase leads to repeated
registrations when multiple NodeTestBase instances are created; add a
module-level boolean flag (e.g., _graceful_shutdown_registered) and check it
before calling register_graceful_shutdown() inside the NodeTestBase
initialization (or wherever register_graceful_shutdown() is invoked), set the
flag to True after the first registration so subsequent instantiations skip
re-registering the signal handlers.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/operator/backend_test_runner/daemonset_manager.py`:
- Around line 284-306: The _wait_for_daemonset_deletion method currently logs a
warning and returns on timeout which can lead to a 409 Conflict when
create_daemonset is called; change the behavior to raise a clear exception
(e.g., TimeoutError or a custom DaemonSetDeletionTimeout) when the loop times
out so callers (such as create_daemonset) fail fast and can handle the error;
update the end of _wait_for_daemonset_deletion to raise that exception with a
descriptive message including self.name and timeout.

---

Outside diff comments:
In `@src/operator/backend_test_runner/daemonset_manager.py`:
- Around line 330-333: The finally block in deploy_and_wait() currently calls
delete_daemonset() but doesn't wait for completion, so update the finally to
call self.delete_daemonset() followed by waiting for deletion (call
self._wait_for_daemonset_deletion() or the existing public wait helper) and
handle/ignore expected not-found errors; ensure you reference the
self.delete_daemonset() and self._wait_for_daemonset_deletion() methods so the
old DaemonSet is fully removed before any retry or subsequent deploy_and_wait()
invocation.

---

Nitpick comments:
In `@src/operator/utils/node_validation_test/test_base.py`:
- Around line 159-160: The call to register_graceful_shutdown() in NodeTestBase
leads to repeated registrations when multiple NodeTestBase instances are
created; add a module-level boolean flag (e.g., _graceful_shutdown_registered)
and check it before calling register_graceful_shutdown() inside the NodeTestBase
initialization (or wherever register_graceful_shutdown() is invoked), set the
flag to True after the first registration so subsequent instantiations skip
re-registering the signal handlers.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ad1dfebe-7993-413b-bbca-84b3380da3f0

📥 Commits

Reviewing files that changed from the base of the PR and between a882e83 and 835403a.

📒 Files selected for processing (2)

src/operator/backend_test_runner/daemonset_manager.py
src/operator/utils/node_validation_test/test_base.py

src/operator/backend_test_runner/daemonset_manager.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/operator/utils/node_validation_test/test_base.py`:
- Line 159: NodeTestBase currently calls register_graceful_shutdown() in its
__init__, which installs a process-wide SIGTERM handler and thus overrides
behavior for the whole process; change this by removing the registration from
NodeTestBase.__init__ and either (A) add an opt-in parameter to
NodeTestBase.__init__ (e.g., register_signal: bool = False) and only call
register_graceful_shutdown() when that flag is true, or (B) move the
register_graceful_shutdown() call out of NodeTestBase entirely and invoke it
explicitly from each validator entrypoint main() (ConnectionValidator main,
ResourceValidator main, LFSValidator main) before constructing NodeTestBase;
ensure DaemonSetManager codepath does not call register_graceful_shutdown()
unless explicitly intended.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1dea3fbe-7f74-4f74-88a3-349492f70b7c

📥 Commits

Reviewing files that changed from the base of the PR and between 835403a and 9fc8d2c.

📒 Files selected for processing (2)

src/operator/backend_test_runner/daemonset_manager.py
src/operator/utils/node_validation_test/test_base.py

src/operator/utils/node_validation_test/test_base.py

tdewanNvidia requested a review from a team as a code owner March 11, 2026 18:55

tdewanNvidia had a problem deploying to internal-ci March 11, 2026 18:55 — with GitHub Actions Failure

coderabbitai bot reviewed Mar 11, 2026

View reviewed changes

src/operator/backend_test_runner/daemonset_manager.py Show resolved Hide resolved

fernandol-nvidia previously approved these changes Mar 11, 2026

View reviewed changes

Fix auxilary pod cleanup

9fc8d2c

tdewanNvidia dismissed fernandol-nvidia’s stale review via 9fc8d2c March 11, 2026 19:06

tdewanNvidia force-pushed the tdewan/fix_cleanup branch from 835403a to 9fc8d2c Compare March 11, 2026 19:06

tdewanNvidia temporarily deployed to internal-ci March 11, 2026 19:06 — with GitHub Actions Inactive

coderabbitai bot reviewed Mar 11, 2026

View reviewed changes

src/operator/utils/node_validation_test/test_base.py Show resolved Hide resolved

ecolternv approved these changes Mar 11, 2026

View reviewed changes

tdewanNvidia enabled auto-merge (squash) March 11, 2026 19:27

fernandol-nvidia approved these changes Mar 11, 2026

View reviewed changes

tdewanNvidia merged commit bcb07fa into main Mar 11, 2026
9 checks passed

tdewanNvidia deleted the tdewan/fix_cleanup branch March 11, 2026 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix auxilary pod cleanup#687

Fix auxilary pod cleanup#687
tdewanNvidia merged 1 commit intomainfrom
tdewan/fix_cleanup

tdewanNvidia commented Mar 11, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 11, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tdewanNvidia commented Mar 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tdewanNvidia commented Mar 11, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 11, 2026 •

edited

Loading