Skip to content

Fix auxilary pod cleanup#687

Merged
tdewanNvidia merged 1 commit intomainfrom
tdewan/fix_cleanup
Mar 11, 2026
Merged

Fix auxilary pod cleanup#687
tdewanNvidia merged 1 commit intomainfrom
tdewan/fix_cleanup

Conversation

@tdewanNvidia
Copy link
Contributor

@tdewanNvidia tdewanNvidia commented Mar 11, 2026

Description

Fix auxiliary pod cleanup and daemonset recreation race condition

hen the DaemonSet manager cleans up validator pods, it was using grace_period_seconds=0 which sends SIGKILL immediately. This prevented Python finally blocks from executing, leaving auxiliary pods orphaned and consuming GPU resources indefinitely. Changed to grace_period_seconds=30 and added a SIGTERM handler in NodeTestBase that converts SIGTERM into SystemExit, which triggers finally blocks for proper cleanup.

Issue - None

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Summary by CodeRabbit

  • Bug Fixes
    • Improved DaemonSet redeployment cleanup by waiting for full removal before recreation and increasing pod termination grace period to reduce orphaned pods.
    • Added graceful shutdown handling for termination signals so cleanup routines run reliably during shutdown.

@tdewanNvidia tdewanNvidia requested a review from a team as a code owner March 11, 2026 18:55
@coderabbitai
Copy link

coderabbitai bot commented Mar 11, 2026

📝 Walkthrough

Walkthrough

Adds polling to wait for DaemonSet removal before redeployment, increases pod deletion grace period to 30 seconds, and registers a SIGTERM handler that raises SystemExit during NodeTestBase initialization.

Changes

Cohort / File(s) Summary
DaemonSet Manager Cleanup
src/operator/backend_test_runner/daemonset_manager.py
Added _wait_for_daemonset_deletion(timeout: int = 120) to poll until the DaemonSet is removed (2s intervals, 120s timeout); changed delete_namespaced_daemon_set pod deletion grace_period_seconds from 0 to 30; deploy_and_wait() now calls the new wait helper after deletion before recreating the DaemonSet.
Graceful Shutdown Support
src/operator/utils/node_validation_test/test_base.py
Added _sigterm_handler(signum, frame) that logs SIGTERM and raises SystemExit(128 + signum); added register_graceful_shutdown() to install the SIGTERM handler; registered this handler in NodeTestBase.__init__.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I watched the daemonset vanish one night,
I waited in hush as the API took flight,
A SIGTERM I heard, so I bowed with a sigh,
I raised SystemExit and let cleanup try,
Now pods sleep in order beneath the moonlight.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fix auxilary pod cleanup' directly addresses the main purpose of the PR, which is fixing cleanup of auxiliary pods by addressing a race condition with graceful shutdown.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch tdewan/fix_cleanup

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/operator/backend_test_runner/daemonset_manager.py (1)

330-333: ⚠️ Potential issue | 🟡 Minor

Final cleanup doesn't wait for deletion to complete.

The finally block calls delete_daemonset() but not _wait_for_daemonset_deletion(). If deploy_and_wait() is called again quickly (e.g., in a retry loop), the old DaemonSet may still exist, causing a 409 Conflict despite the wait logic at line 318.

🛡️ Proposed fix: wait for deletion in finally block
         finally:
             # Clean up resources
             self.delete_daemonset()
+            self._wait_for_daemonset_deletion()
         return status
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/operator/backend_test_runner/daemonset_manager.py` around lines 330 -
333, The finally block in deploy_and_wait() currently calls delete_daemonset()
but doesn't wait for completion, so update the finally to call
self.delete_daemonset() followed by waiting for deletion (call
self._wait_for_daemonset_deletion() or the existing public wait helper) and
handle/ignore expected not-found errors; ensure you reference the
self.delete_daemonset() and self._wait_for_daemonset_deletion() methods so the
old DaemonSet is fully removed before any retry or subsequent deploy_and_wait()
invocation.
🧹 Nitpick comments (1)
src/operator/utils/node_validation_test/test_base.py (1)

159-160: Consider guarding against redundant signal handler registration.

register_graceful_shutdown() is called every time a NodeTestBase is instantiated. While signal handler re-registration is idempotent, if multiple NodeTestBase instances are created it results in redundant calls. Consider using a module-level flag to ensure one-time registration.

♻️ Optional: guard with a module-level flag
+_graceful_shutdown_registered = False
+
 def register_graceful_shutdown() -> None:
     """Register a SIGTERM handler that triggers finally-block cleanup.

     Kubernetes sends SIGTERM before SIGKILL during pod termination.
     Python's default SIGTERM handler terminates without running finally blocks.
     This converts SIGTERM into SystemExit, which does trigger finally blocks,
     allowing validators to clean up resources (e.g. benchmark pods) on shutdown.
     """
+    global _graceful_shutdown_registered
+    if _graceful_shutdown_registered:
+        return
     signal.signal(signal.SIGTERM, _sigterm_handler)
+    _graceful_shutdown_registered = True
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/operator/utils/node_validation_test/test_base.py` around lines 159 - 160,
The call to register_graceful_shutdown() in NodeTestBase leads to repeated
registrations when multiple NodeTestBase instances are created; add a
module-level boolean flag (e.g., _graceful_shutdown_registered) and check it
before calling register_graceful_shutdown() inside the NodeTestBase
initialization (or wherever register_graceful_shutdown() is invoked), set the
flag to True after the first registration so subsequent instantiations skip
re-registering the signal handlers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/operator/backend_test_runner/daemonset_manager.py`:
- Around line 284-306: The _wait_for_daemonset_deletion method currently logs a
warning and returns on timeout which can lead to a 409 Conflict when
create_daemonset is called; change the behavior to raise a clear exception
(e.g., TimeoutError or a custom DaemonSetDeletionTimeout) when the loop times
out so callers (such as create_daemonset) fail fast and can handle the error;
update the end of _wait_for_daemonset_deletion to raise that exception with a
descriptive message including self.name and timeout.

---

Outside diff comments:
In `@src/operator/backend_test_runner/daemonset_manager.py`:
- Around line 330-333: The finally block in deploy_and_wait() currently calls
delete_daemonset() but doesn't wait for completion, so update the finally to
call self.delete_daemonset() followed by waiting for deletion (call
self._wait_for_daemonset_deletion() or the existing public wait helper) and
handle/ignore expected not-found errors; ensure you reference the
self.delete_daemonset() and self._wait_for_daemonset_deletion() methods so the
old DaemonSet is fully removed before any retry or subsequent deploy_and_wait()
invocation.

---

Nitpick comments:
In `@src/operator/utils/node_validation_test/test_base.py`:
- Around line 159-160: The call to register_graceful_shutdown() in NodeTestBase
leads to repeated registrations when multiple NodeTestBase instances are
created; add a module-level boolean flag (e.g., _graceful_shutdown_registered)
and check it before calling register_graceful_shutdown() inside the NodeTestBase
initialization (or wherever register_graceful_shutdown() is invoked), set the
flag to True after the first registration so subsequent instantiations skip
re-registering the signal handlers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ad1dfebe-7993-413b-bbca-84b3380da3f0

📥 Commits

Reviewing files that changed from the base of the PR and between a882e83 and 835403a.

📒 Files selected for processing (2)
  • src/operator/backend_test_runner/daemonset_manager.py
  • src/operator/utils/node_validation_test/test_base.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/operator/utils/node_validation_test/test_base.py`:
- Line 159: NodeTestBase currently calls register_graceful_shutdown() in its
__init__, which installs a process-wide SIGTERM handler and thus overrides
behavior for the whole process; change this by removing the registration from
NodeTestBase.__init__ and either (A) add an opt-in parameter to
NodeTestBase.__init__ (e.g., register_signal: bool = False) and only call
register_graceful_shutdown() when that flag is true, or (B) move the
register_graceful_shutdown() call out of NodeTestBase entirely and invoke it
explicitly from each validator entrypoint main() (ConnectionValidator main,
ResourceValidator main, LFSValidator main) before constructing NodeTestBase;
ensure DaemonSetManager codepath does not call register_graceful_shutdown()
unless explicitly intended.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 1dea3fbe-7f74-4f74-88a3-349492f70b7c

📥 Commits

Reviewing files that changed from the base of the PR and between 835403a and 9fc8d2c.

📒 Files selected for processing (2)
  • src/operator/backend_test_runner/daemonset_manager.py
  • src/operator/utils/node_validation_test/test_base.py

@tdewanNvidia tdewanNvidia enabled auto-merge (squash) March 11, 2026 19:27
@tdewanNvidia tdewanNvidia merged commit bcb07fa into main Mar 11, 2026
9 checks passed
@tdewanNvidia tdewanNvidia deleted the tdewan/fix_cleanup branch March 11, 2026 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants