Skip to content

Conversation

@johanneskoester
Copy link
Contributor

@johanneskoester johanneskoester commented May 27, 2025

Summary by CodeRabbit

  • Bug Fixes

    • Improved error handling and logging for job status checks and failures.
    • Enhanced robustness when dealing with missing or failed Kubernetes jobs.
    • Refined pod log retrieval and job/pod cleanup processes for better reliability.
  • Refactor

    • Updated job status checking to use asynchronous processing for better performance and reliability.

…tem attaches sidecar pods, which can otherwise remain after failure)
@coderabbitai
Copy link
Contributor

coderabbitai bot commented May 27, 2025

Warning

Rate limit exceeded

@johanneskoester has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 14 minutes and 51 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 978fc99 and 7e1c9a6.

📒 Files selected for processing (1)
  • snakemake_executor_plugin_kubernetes/__init__.py (7 hunks)
📝 Walkthrough

Walkthrough

The changes update the check_active_jobs method in the Kubernetes Snakemake executor plugin to use asynchronous generators, refine error handling for job and pod status checks, and improve resource cleanup logic. Adjustments include explicit variable captures in lambdas, improved logging, and more robust handling of missing or failed Kubernetes jobs.

Changes

File(s) Change Summary
snakemake_executor_plugin_kubernetes/init.py Changed check_active_jobs return type to AsyncGenerator; added explicit lambda captures; removed special 404 handling in job status; added res is None check; stored pod name in variable; conditional pod log retrieval; improved job failure cleanup with retry and safe deletion; changed 404 log level in safe_delete_job from warning to debug; updated _kubernetes_retry return type annotation; fixed log message formatting in run_job.

Sequence Diagram(s)

sequenceDiagram
    participant Executor
    participant KubernetesAPI
    participant Logger

    Executor->>KubernetesAPI: Read job status (with retry)
    alt Job status not found (404)
        Executor->>Logger: Log error
        Executor->>Executor: Report job error
    else Job status found
        alt Job failed
            Executor->>KubernetesAPI: Get pod name
            alt Pod name available
                Executor->>KubernetesAPI: Read pod logs
                Executor->>Logger: Log error with aux_logs
            else Pod name missing
                Executor->>Logger: Log error without aux_logs
            end
            Executor->>KubernetesAPI: Delete job (with retry)
        else Job succeeded
            Executor->>KubernetesAPI: Delete job (with retry)
        end
    end
Loading

Possibly related PRs

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
snakemake_executor_plugin_kubernetes/__init__.py (1)

7-7: Remove the unused Generator import.

Since the method signature has been changed to use AsyncGenerator, the Generator import is no longer needed.

-from typing import AsyncGenerator, List, Generator, Optional, Self
+from typing import AsyncGenerator, List, Optional, Self
🧰 Tools
🪛 Ruff (0.11.9)

7-7: typing.Generator imported but unused

(F401)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bb5ccca and bc04b38.

📒 Files selected for processing (1)
  • snakemake_executor_plugin_kubernetes/__init__.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest type annotations for functions that are defined inside of functions or methods. Do not suggest type annotation of the `s...

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

  • snakemake_executor_plugin_kubernetes/__init__.py
🪛 Ruff (0.11.9)
snakemake_executor_plugin_kubernetes/__init__.py

7-7: typing.Generator imported but unused

(F401)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: testing
🔇 Additional comments (8)
snakemake_executor_plugin_kubernetes/__init__.py (8)

397-399: Good fix for the async generator type annotation!

The change from Generator to AsyncGenerator correctly reflects that this is an async method that yields values.


418-422: Excellent fix for the lambda late binding issue!

The explicit capture lambda j=j: ensures that each lambda closure captures the correct job instance from the loop iteration, preventing potential bugs where all lambdas would reference the last job.


423-426: Consistent error handling for all API exceptions.

The simplified approach of treating all ApiException errors uniformly is appropriate, as any API error (including 404) indicates a problem that should be reported.


432-439: Good defensive programming with the None check.

Adding an explicit check for res is None prevents potential AttributeError and provides a clear error message when the job status cannot be retrieved.


463-467: Clean refactoring to extract pod_name.

Storing the pod name in a variable improves readability and avoids repeated attribute access.


469-499: Excellent improvements to job failure handling!

The changes introduce several important improvements:

  1. Conditional log collection only when pod exists (preventing errors)
  2. Proper cleanup of failed jobs (crucial for environments with sidecar containers)
  3. Consistent lambda capture pattern

This addresses the PR objective of cleaning up jobs after failure.


505-509: Consistent lambda capture pattern maintained.

Good to see the same lambda capture fix applied here for successful job cleanup.


576-579: Appropriate log level adjustment.

Changing from warning to debug for expected 404 errors (when ignore_not_found=True) reduces log noise while still capturing the information for debugging purposes.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
snakemake_executor_plugin_kubernetes/__init__.py (1)

478-491: Address the encoding warning from static analysis.

The conditional log reading logic is sound, but there's a static analysis warning about not specifying encoding when opening the file.

-                        with open(kube_log, "w") as f:
+                        with open(kube_log, "w", encoding="utf-8") as f:
🧰 Tools
🪛 Pylint (3.3.7)

[warning] 481-481: Using open without explicitly specifying an encoding

(W1514)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5c51039 and c14da96.

📒 Files selected for processing (1)
  • snakemake_executor_plugin_kubernetes/__init__.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest type annotations for functions that are defined inside of functions or methods. Do not suggest type annotation of the `s...

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

  • snakemake_executor_plugin_kubernetes/__init__.py
🪛 Pylint (3.3.7)
snakemake_executor_plugin_kubernetes/__init__.py

[convention] 397-397: Missing function or method docstring

(C0116)


[refactor] 397-397: Too many statements (51/50)

(R0915)


[convention] 434-434: Formatting a regular string which could be an f-string

(C0209)


[warning] 481-481: Using open without explicitly specifying an encoding

(W1514)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: testing
🔇 Additional comments (11)
snakemake_executor_plugin_kubernetes/__init__.py (11)

7-7: LGTM! Correct import for async generator.

The import change from Generator to AsyncGenerator is correct and aligns with the method signature change.


391-391: LGTM! Proper f-string usage.

Good fix to use f-string interpolation for the job ID, which is more efficient and readable than string concatenation.


399-399: LGTM! Correct async generator type annotation.

The method signature change from Generator to AsyncGenerator is correct for an async method that yields values.


419-422: LGTM! Fixed lambda late binding issue.

The explicit variable capture lambda j=j correctly fixes the late binding problem where the lambda would capture the loop variable by reference rather than by value.


432-438: LGTM! Added defensive programming for None response.

Good defensive programming to explicitly check for res is None and handle the case where the job status response is missing. The error message is clear and informative.

🧰 Tools
🪛 Pylint (3.3.7)

[convention] 434-434: Formatting a regular string which could be an f-string

(C0209)


463-463: LGTM! Efficient pod name extraction.

Good optimization to extract the pod name into a variable to avoid repeated attribute access and make the code more readable.

Also applies to: 467-467


493-493: LGTM! Proper aux_logs integration.

Good integration of the auxiliary logs into the error reporting, which will help with debugging failed jobs.


495-499: LGTM! Implements the PR objective perfectly.

This is the core change that implements the PR objective - deleting Kubernetes jobs after failure to prevent sidecar pods from remaining. The lambda capture pattern is correctly applied.


506-506: LGTM! Consistent lambda capture pattern.

Good consistency in applying the lambda capture pattern to the success case as well, even though it was already working correctly.


576-576: LGTM! Appropriate log level change.

Changing from warning to debug level for 404 "not found" errors is appropriate since these are expected in normal operation when jobs are already deleted.


424-426:

Details

❓ Verification inconclusive

Verify the removal of special 404 handling.

The special handling for 404 ApiException has been removed. Previously, 404 errors were handled differently, but now all ApiExceptions are treated as errors and reported.


🏁 Script executed:

#!/bin/bash
# Description: Check if there are other places in the codebase that handle 404 ApiExceptions differently
# Expected: Find how 404 errors are handled elsewhere to ensure consistency

rg -A 3 -B 3 "ApiException.*404|404.*ApiException" --type py

Length of output: 62


To ensure we didn’t miss any special‐case handling for 404 errors, let’s search for all ApiException usages and any explicit status checks or HTTPStatus references:


🏁 Script executed:

#!/usr/bin/env bash
set -e
# 1. List every ApiException occurrence to review its context
rg -n "ApiException" --type py

# 2. Look for any explicit 404 status checks
rg -n -C2 "status.*404" --type py

# 3. Search for HTTPStatus.NOT_FOUND usage
rg -n "NOT_FOUND" --type py

Length of output: 1440


Confirm intentional 404‐ignore removal in pod-status checks

I only found one special 404 handler in the delete-job path (around lines 573–577). In the pod-status loop (lines 423–426), every ApiException—including 404—is now logged as an error and reported:

• snakemake_executor_plugin_kubernetes/init.py:423–426

If you intended to stop ignoring missing pods when polling status, this change is fine. Otherwise, you may want to reintroduce something like:

-   except ApiException as e:
-       self.logger.error(f"ApiException when checking pod status: {e}")
-       self.report_job_error(j, msg=str(e))
-       continue
+   except ApiException as e:
+       if e.status == 404 and ignore_not_found:
+           self.logger.debug(f"[WARNING] Pod not found for job {j.id}, ignoring.")
+           continue
+       self.logger.error(f"ApiException when checking pod status: {e}")
+       self.report_job_error(j, msg=str(e))
+       continue

Please verify that treating 404 here as a hard error matches your intent.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
snakemake_executor_plugin_kubernetes/__init__.py (1)

453-462: ⚠️ Potential issue

Possible IndexError when the snakemake container is missing

pod.status.container_statuses can be None, and even if it exists the
list-comprehension may be empty (e.g. init-/sidecar-only pods).
[0] will then raise an IndexError, breaking the whole status loop.

-                    snakemake_container = [
-                        container
-                        for container in pod.status.container_statuses
-                        if container.name == "snakemake"
-                    ][0]
-                    snakemake_container_exit_code = (
-                        snakemake_container.state.terminated.exit_code
-                        if snakemake_container.state.terminated is not None
-                        else None
-                    )
+                    container_statuses = pod.status.container_statuses or []
+                    snakemake_candidates = [
+                        c for c in container_statuses if c.name == "snakemake"
+                    ]
+                    snakemake_container = snakemake_candidates[0] if snakemake_candidates else None
+                    snakemake_container_exit_code = (
+                        snakemake_container.state.terminated.exit_code
+                        if snakemake_container
+                        and snakemake_container.state
+                        and snakemake_container.state.terminated is not None
+                        else None
+                    )
🧹 Nitpick comments (1)
snakemake_executor_plugin_kubernetes/__init__.py (1)

572-584: Pod deletion bypasses retry / re-authentication logic

Each pod deletion can hit the same 401 token-expiry issue as any other
call. Wrap both listing and deletion in _kubernetes_retry to keep the
behaviour consistent.

-            pods = self.kubeapi.list_namespaced_pod(
+            pods = self._kubernetes_retry(
+                lambda: self.kubeapi.list_namespaced_pod(
                     namespace=self.namespace,
                     label_selector=f"job-name={jobid}",
-            )
+                )
+            )
             for pod in pods.items:
                 self.logger.debug(f"Deleting pod {pod.metadata.name} for job {jobid}")
-                self.kubeapi.delete_namespaced_pod(
-                    pod.metadata.name, self.namespace, body=body
-                )
+                self._kubernetes_retry(
+                    lambda n=pod.metadata.name: self.kubeapi.delete_namespaced_pod(
+                        n, self.namespace, body=body
+                    )
+                )
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c14da96 and 8f2ee17.

📒 Files selected for processing (1)
  • snakemake_executor_plugin_kubernetes/__init__.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest type annotations for functions that are defined inside of functions or methods. Do not suggest type annotation of the `s...

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

  • snakemake_executor_plugin_kubernetes/__init__.py
🪛 Pylint (3.3.7)
snakemake_executor_plugin_kubernetes/__init__.py

[refactor] 397-397: Too many statements (51/50)

(R0915)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: testing

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
snakemake_executor_plugin_kubernetes/__init__.py (1)

424-426: ⚠️ Potential issue

Cleanup still missing on ApiException path
The earlier review (see legacy comment on lines 424-427) already pointed out that we leak Jobs/Pods when the status request itself fails.
The new code still exits after continue without calling safe_delete_job, so the resource-leak persists.

-                    self.report_job_error(j, msg=str(e))
-                    continue
+                    self.report_job_error(j, msg=str(e))
+                    # ensure cleanup even when the status call fails
+                    self._kubernetes_retry(
+                        lambda j=j: self.safe_delete_job(
+                            j.external_jobid, ignore_not_found=True
+                        )
+                    )
+                    continue
🧹 Nitpick comments (1)
snakemake_executor_plugin_kubernetes/__init__.py (1)

574-588: Consider wrapping pod deletion in _kubernetes_retry
safe_delete_job() is usually called via _kubernetes_retry, but nothing prevents future direct calls.
Wrapping the internal pod-list / delete sequence in its own retry would make safe_delete_job robust when used standalone.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f2ee17 and 978fc99.

📒 Files selected for processing (1)
  • snakemake_executor_plugin_kubernetes/__init__.py (6 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest type annotations for functions that are defined inside of functions or methods. Do not suggest type annotation of the `s...

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

  • snakemake_executor_plugin_kubernetes/__init__.py
🪛 Pylint (3.3.7)
snakemake_executor_plugin_kubernetes/__init__.py

[refactor] 397-397: Too many statements (51/50)

(R0915)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: testing
🔇 Additional comments (1)
snakemake_executor_plugin_kubernetes/__init__.py (1)

419-423: Good catch – late-binding bug removed
Capturing j inside the lambda (lambda j=j: …) prevents the classic late-binding pitfall when the loop continues. 👍

@johanneskoester johanneskoester merged commit 1b24e6f into main Jun 13, 2025
5 checks passed
@johanneskoester johanneskoester deleted the fix/delete-failed-jobs branch June 13, 2025 11:52
johanneskoester pushed a commit that referenced this pull request Jun 13, 2025
🤖 I have created a release *beep* *boop*
---


##
[0.4.2](v0.4.1...v0.4.2)
(2025-06-13)


### Bug Fixes

* delete k8s jobs after failure (important in setups where the system
attaches sidecar pods, which can otherwise remain after failure)
([#49](#49))
([1b24e6f](1b24e6f))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants