Conversation
📝 WalkthroughWalkthroughAdds polling to wait for DaemonSet removal before redeployment, increases pod deletion grace period to 30 seconds, and registers a SIGTERM handler that raises SystemExit during NodeTestBase initialization. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/operator/backend_test_runner/daemonset_manager.py (1)
330-333:⚠️ Potential issue | 🟡 MinorFinal cleanup doesn't wait for deletion to complete.
The
finallyblock callsdelete_daemonset()but not_wait_for_daemonset_deletion(). Ifdeploy_and_wait()is called again quickly (e.g., in a retry loop), the old DaemonSet may still exist, causing a 409 Conflict despite the wait logic at line 318.🛡️ Proposed fix: wait for deletion in finally block
finally: # Clean up resources self.delete_daemonset() + self._wait_for_daemonset_deletion() return status🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/operator/backend_test_runner/daemonset_manager.py` around lines 330 - 333, The finally block in deploy_and_wait() currently calls delete_daemonset() but doesn't wait for completion, so update the finally to call self.delete_daemonset() followed by waiting for deletion (call self._wait_for_daemonset_deletion() or the existing public wait helper) and handle/ignore expected not-found errors; ensure you reference the self.delete_daemonset() and self._wait_for_daemonset_deletion() methods so the old DaemonSet is fully removed before any retry or subsequent deploy_and_wait() invocation.
🧹 Nitpick comments (1)
src/operator/utils/node_validation_test/test_base.py (1)
159-160: Consider guarding against redundant signal handler registration.
register_graceful_shutdown()is called every time aNodeTestBaseis instantiated. While signal handler re-registration is idempotent, if multipleNodeTestBaseinstances are created it results in redundant calls. Consider using a module-level flag to ensure one-time registration.♻️ Optional: guard with a module-level flag
+_graceful_shutdown_registered = False + def register_graceful_shutdown() -> None: """Register a SIGTERM handler that triggers finally-block cleanup. Kubernetes sends SIGTERM before SIGKILL during pod termination. Python's default SIGTERM handler terminates without running finally blocks. This converts SIGTERM into SystemExit, which does trigger finally blocks, allowing validators to clean up resources (e.g. benchmark pods) on shutdown. """ + global _graceful_shutdown_registered + if _graceful_shutdown_registered: + return signal.signal(signal.SIGTERM, _sigterm_handler) + _graceful_shutdown_registered = True🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/operator/utils/node_validation_test/test_base.py` around lines 159 - 160, The call to register_graceful_shutdown() in NodeTestBase leads to repeated registrations when multiple NodeTestBase instances are created; add a module-level boolean flag (e.g., _graceful_shutdown_registered) and check it before calling register_graceful_shutdown() inside the NodeTestBase initialization (or wherever register_graceful_shutdown() is invoked), set the flag to True after the first registration so subsequent instantiations skip re-registering the signal handlers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/operator/backend_test_runner/daemonset_manager.py`:
- Around line 284-306: The _wait_for_daemonset_deletion method currently logs a
warning and returns on timeout which can lead to a 409 Conflict when
create_daemonset is called; change the behavior to raise a clear exception
(e.g., TimeoutError or a custom DaemonSetDeletionTimeout) when the loop times
out so callers (such as create_daemonset) fail fast and can handle the error;
update the end of _wait_for_daemonset_deletion to raise that exception with a
descriptive message including self.name and timeout.
---
Outside diff comments:
In `@src/operator/backend_test_runner/daemonset_manager.py`:
- Around line 330-333: The finally block in deploy_and_wait() currently calls
delete_daemonset() but doesn't wait for completion, so update the finally to
call self.delete_daemonset() followed by waiting for deletion (call
self._wait_for_daemonset_deletion() or the existing public wait helper) and
handle/ignore expected not-found errors; ensure you reference the
self.delete_daemonset() and self._wait_for_daemonset_deletion() methods so the
old DaemonSet is fully removed before any retry or subsequent deploy_and_wait()
invocation.
---
Nitpick comments:
In `@src/operator/utils/node_validation_test/test_base.py`:
- Around line 159-160: The call to register_graceful_shutdown() in NodeTestBase
leads to repeated registrations when multiple NodeTestBase instances are
created; add a module-level boolean flag (e.g., _graceful_shutdown_registered)
and check it before calling register_graceful_shutdown() inside the NodeTestBase
initialization (or wherever register_graceful_shutdown() is invoked), set the
flag to True after the first registration so subsequent instantiations skip
re-registering the signal handlers.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ad1dfebe-7993-413b-bbca-84b3380da3f0
📒 Files selected for processing (2)
src/operator/backend_test_runner/daemonset_manager.pysrc/operator/utils/node_validation_test/test_base.py
835403a to
9fc8d2c
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/operator/utils/node_validation_test/test_base.py`:
- Line 159: NodeTestBase currently calls register_graceful_shutdown() in its
__init__, which installs a process-wide SIGTERM handler and thus overrides
behavior for the whole process; change this by removing the registration from
NodeTestBase.__init__ and either (A) add an opt-in parameter to
NodeTestBase.__init__ (e.g., register_signal: bool = False) and only call
register_graceful_shutdown() when that flag is true, or (B) move the
register_graceful_shutdown() call out of NodeTestBase entirely and invoke it
explicitly from each validator entrypoint main() (ConnectionValidator main,
ResourceValidator main, LFSValidator main) before constructing NodeTestBase;
ensure DaemonSetManager codepath does not call register_graceful_shutdown()
unless explicitly intended.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 1dea3fbe-7f74-4f74-88a3-349492f70b7c
📒 Files selected for processing (2)
src/operator/backend_test_runner/daemonset_manager.pysrc/operator/utils/node_validation_test/test_base.py
Description
Fix auxiliary pod cleanup and daemonset recreation race condition
hen the DaemonSet manager cleans up validator pods, it was using grace_period_seconds=0 which sends SIGKILL immediately. This prevented Python finally blocks from executing, leaving auxiliary pods orphaned and consuming GPU resources indefinitely. Changed to grace_period_seconds=30 and added a SIGTERM handler in NodeTestBase that converts SIGTERM into SystemExit, which triggers finally blocks for proper cleanup.
Issue - None
Checklist
Summary by CodeRabbit