-
Notifications
You must be signed in to change notification settings - Fork 12
fix: delete k8s jobs after failure (important in setups where the system attaches sidecar pods, which can otherwise remain after failure) #49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…tem attaches sidecar pods, which can otherwise remain after failure)
|
Warning Rate limit exceeded@johanneskoester has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 14 minutes and 51 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe changes update the Changes
Sequence Diagram(s)sequenceDiagram
participant Executor
participant KubernetesAPI
participant Logger
Executor->>KubernetesAPI: Read job status (with retry)
alt Job status not found (404)
Executor->>Logger: Log error
Executor->>Executor: Report job error
else Job status found
alt Job failed
Executor->>KubernetesAPI: Get pod name
alt Pod name available
Executor->>KubernetesAPI: Read pod logs
Executor->>Logger: Log error with aux_logs
else Pod name missing
Executor->>Logger: Log error without aux_logs
end
Executor->>KubernetesAPI: Delete job (with retry)
else Job succeeded
Executor->>KubernetesAPI: Delete job (with retry)
end
end
Possibly related PRs
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
snakemake_executor_plugin_kubernetes/__init__.py (1)
7-7: Remove the unusedGeneratorimport.Since the method signature has been changed to use
AsyncGenerator, theGeneratorimport is no longer needed.-from typing import AsyncGenerator, List, Generator, Optional, Self +from typing import AsyncGenerator, List, Optional, Self🧰 Tools
🪛 Ruff (0.11.9)
7-7:
typing.Generatorimported but unused(F401)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
snakemake_executor_plugin_kubernetes/__init__.py(5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest type annotations for functions that are defined inside of functions or methods. Do not suggest type annotation of the `s...
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
snakemake_executor_plugin_kubernetes/__init__.py
🪛 Ruff (0.11.9)
snakemake_executor_plugin_kubernetes/__init__.py
7-7: typing.Generator imported but unused
(F401)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: testing
🔇 Additional comments (8)
snakemake_executor_plugin_kubernetes/__init__.py (8)
397-399: Good fix for the async generator type annotation!The change from
GeneratortoAsyncGeneratorcorrectly reflects that this is an async method that yields values.
418-422: Excellent fix for the lambda late binding issue!The explicit capture
lambda j=j:ensures that each lambda closure captures the correct job instance from the loop iteration, preventing potential bugs where all lambdas would reference the last job.
423-426: Consistent error handling for all API exceptions.The simplified approach of treating all
ApiExceptionerrors uniformly is appropriate, as any API error (including 404) indicates a problem that should be reported.
432-439: Good defensive programming with the None check.Adding an explicit check for
res is Noneprevents potential AttributeError and provides a clear error message when the job status cannot be retrieved.
463-467: Clean refactoring to extract pod_name.Storing the pod name in a variable improves readability and avoids repeated attribute access.
469-499: Excellent improvements to job failure handling!The changes introduce several important improvements:
- Conditional log collection only when pod exists (preventing errors)
- Proper cleanup of failed jobs (crucial for environments with sidecar containers)
- Consistent lambda capture pattern
This addresses the PR objective of cleaning up jobs after failure.
505-509: Consistent lambda capture pattern maintained.Good to see the same lambda capture fix applied here for successful job cleanup.
576-579: Appropriate log level adjustment.Changing from warning to debug for expected 404 errors (when
ignore_not_found=True) reduces log noise while still capturing the information for debugging purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
snakemake_executor_plugin_kubernetes/__init__.py (1)
478-491: Address the encoding warning from static analysis.The conditional log reading logic is sound, but there's a static analysis warning about not specifying encoding when opening the file.
- with open(kube_log, "w") as f: + with open(kube_log, "w", encoding="utf-8") as f:🧰 Tools
🪛 Pylint (3.3.7)
[warning] 481-481: Using open without explicitly specifying an encoding
(W1514)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
snakemake_executor_plugin_kubernetes/__init__.py(5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest type annotations for functions that are defined inside of functions or methods. Do not suggest type annotation of the `s...
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
snakemake_executor_plugin_kubernetes/__init__.py
🪛 Pylint (3.3.7)
snakemake_executor_plugin_kubernetes/__init__.py
[convention] 397-397: Missing function or method docstring
(C0116)
[refactor] 397-397: Too many statements (51/50)
(R0915)
[convention] 434-434: Formatting a regular string which could be an f-string
(C0209)
[warning] 481-481: Using open without explicitly specifying an encoding
(W1514)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: testing
🔇 Additional comments (11)
snakemake_executor_plugin_kubernetes/__init__.py (11)
7-7: LGTM! Correct import for async generator.The import change from
GeneratortoAsyncGeneratoris correct and aligns with the method signature change.
391-391: LGTM! Proper f-string usage.Good fix to use f-string interpolation for the job ID, which is more efficient and readable than string concatenation.
399-399: LGTM! Correct async generator type annotation.The method signature change from
GeneratortoAsyncGeneratoris correct for an async method that yields values.
419-422: LGTM! Fixed lambda late binding issue.The explicit variable capture
lambda j=jcorrectly fixes the late binding problem where the lambda would capture the loop variable by reference rather than by value.
432-438: LGTM! Added defensive programming for None response.Good defensive programming to explicitly check for
res is Noneand handle the case where the job status response is missing. The error message is clear and informative.🧰 Tools
🪛 Pylint (3.3.7)
[convention] 434-434: Formatting a regular string which could be an f-string
(C0209)
463-463: LGTM! Efficient pod name extraction.Good optimization to extract the pod name into a variable to avoid repeated attribute access and make the code more readable.
Also applies to: 467-467
493-493: LGTM! Proper aux_logs integration.Good integration of the auxiliary logs into the error reporting, which will help with debugging failed jobs.
495-499: LGTM! Implements the PR objective perfectly.This is the core change that implements the PR objective - deleting Kubernetes jobs after failure to prevent sidecar pods from remaining. The lambda capture pattern is correctly applied.
506-506: LGTM! Consistent lambda capture pattern.Good consistency in applying the lambda capture pattern to the success case as well, even though it was already working correctly.
576-576: LGTM! Appropriate log level change.Changing from warning to debug level for 404 "not found" errors is appropriate since these are expected in normal operation when jobs are already deleted.
424-426:Details
❓ Verification inconclusive
Verify the removal of special 404 handling.
The special handling for 404 ApiException has been removed. Previously, 404 errors were handled differently, but now all ApiExceptions are treated as errors and reported.
🏁 Script executed:
#!/bin/bash # Description: Check if there are other places in the codebase that handle 404 ApiExceptions differently # Expected: Find how 404 errors are handled elsewhere to ensure consistency rg -A 3 -B 3 "ApiException.*404|404.*ApiException" --type pyLength of output: 62
To ensure we didn’t miss any special‐case handling for 404 errors, let’s search for all ApiException usages and any explicit status checks or HTTPStatus references:
🏁 Script executed:
#!/usr/bin/env bash set -e # 1. List every ApiException occurrence to review its context rg -n "ApiException" --type py # 2. Look for any explicit 404 status checks rg -n -C2 "status.*404" --type py # 3. Search for HTTPStatus.NOT_FOUND usage rg -n "NOT_FOUND" --type pyLength of output: 1440
Confirm intentional 404‐ignore removal in pod-status checks
I only found one special 404 handler in the delete-job path (around lines 573–577). In the pod-status loop (lines 423–426), every
ApiException—including 404—is now logged as an error and reported:• snakemake_executor_plugin_kubernetes/init.py:423–426
If you intended to stop ignoring missing pods when polling status, this change is fine. Otherwise, you may want to reintroduce something like:
- except ApiException as e: - self.logger.error(f"ApiException when checking pod status: {e}") - self.report_job_error(j, msg=str(e)) - continue + except ApiException as e: + if e.status == 404 and ignore_not_found: + self.logger.debug(f"[WARNING] Pod not found for job {j.id}, ignoring.") + continue + self.logger.error(f"ApiException when checking pod status: {e}") + self.report_job_error(j, msg=str(e)) + continuePlease verify that treating 404 here as a hard error matches your intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
snakemake_executor_plugin_kubernetes/__init__.py (1)
453-462:⚠️ Potential issuePossible
IndexErrorwhen thesnakemakecontainer is missing
pod.status.container_statusescan beNone, and even if it exists the
list-comprehension may be empty (e.g. init-/sidecar-only pods).
[0]will then raise anIndexError, breaking the whole status loop.- snakemake_container = [ - container - for container in pod.status.container_statuses - if container.name == "snakemake" - ][0] - snakemake_container_exit_code = ( - snakemake_container.state.terminated.exit_code - if snakemake_container.state.terminated is not None - else None - ) + container_statuses = pod.status.container_statuses or [] + snakemake_candidates = [ + c for c in container_statuses if c.name == "snakemake" + ] + snakemake_container = snakemake_candidates[0] if snakemake_candidates else None + snakemake_container_exit_code = ( + snakemake_container.state.terminated.exit_code + if snakemake_container + and snakemake_container.state + and snakemake_container.state.terminated is not None + else None + )
🧹 Nitpick comments (1)
snakemake_executor_plugin_kubernetes/__init__.py (1)
572-584: Pod deletion bypasses retry / re-authentication logicEach pod deletion can hit the same
401token-expiry issue as any other
call. Wrap both listing and deletion in_kubernetes_retryto keep the
behaviour consistent.- pods = self.kubeapi.list_namespaced_pod( + pods = self._kubernetes_retry( + lambda: self.kubeapi.list_namespaced_pod( namespace=self.namespace, label_selector=f"job-name={jobid}", - ) + ) + ) for pod in pods.items: self.logger.debug(f"Deleting pod {pod.metadata.name} for job {jobid}") - self.kubeapi.delete_namespaced_pod( - pod.metadata.name, self.namespace, body=body - ) + self._kubernetes_retry( + lambda n=pod.metadata.name: self.kubeapi.delete_namespaced_pod( + n, self.namespace, body=body + ) + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
snakemake_executor_plugin_kubernetes/__init__.py(5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest type annotations for functions that are defined inside of functions or methods. Do not suggest type annotation of the `s...
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
snakemake_executor_plugin_kubernetes/__init__.py
🪛 Pylint (3.3.7)
snakemake_executor_plugin_kubernetes/__init__.py
[refactor] 397-397: Too many statements (51/50)
(R0915)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
snakemake_executor_plugin_kubernetes/__init__.py (1)
424-426:⚠️ Potential issueCleanup still missing on
ApiExceptionpath
The earlier review (see legacy comment on lines 424-427) already pointed out that we leak Jobs/Pods when the status request itself fails.
The new code still exits aftercontinuewithout callingsafe_delete_job, so the resource-leak persists.- self.report_job_error(j, msg=str(e)) - continue + self.report_job_error(j, msg=str(e)) + # ensure cleanup even when the status call fails + self._kubernetes_retry( + lambda j=j: self.safe_delete_job( + j.external_jobid, ignore_not_found=True + ) + ) + continue
🧹 Nitpick comments (1)
snakemake_executor_plugin_kubernetes/__init__.py (1)
574-588: Consider wrapping pod deletion in_kubernetes_retry
safe_delete_job()is usually called via_kubernetes_retry, but nothing prevents future direct calls.
Wrapping the internal pod-list / delete sequence in its own retry would makesafe_delete_jobrobust when used standalone.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
snakemake_executor_plugin_kubernetes/__init__.py(6 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest type annotations for functions that are defined inside of functions or methods. Do not suggest type annotation of the `s...
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
snakemake_executor_plugin_kubernetes/__init__.py
🪛 Pylint (3.3.7)
snakemake_executor_plugin_kubernetes/__init__.py
[refactor] 397-397: Too many statements (51/50)
(R0915)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: testing
🔇 Additional comments (1)
snakemake_executor_plugin_kubernetes/__init__.py (1)
419-423: Good catch – late-binding bug removed
Capturingjinside the lambda (lambda j=j: …) prevents the classic late-binding pitfall when the loop continues. 👍
🤖 I have created a release *beep* *boop* --- ## [0.4.2](v0.4.1...v0.4.2) (2025-06-13) ### Bug Fixes * delete k8s jobs after failure (important in setups where the system attaches sidecar pods, which can otherwise remain after failure) ([#49](#49)) ([1b24e6f](1b24e6f)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Summary by CodeRabbit
Bug Fixes
Refactor