fix: Adding additional logic to handle resource limit requirements #38

mrburke00 · 2025-03-06T18:15:39Z

Hey @johanneskoester I found a bug with my GPU code. It turns out that in many default configurations for Kubernetes clusters there is a limit range or some other admission controller requiring both resource requests and resource limits when scaling to very large jobs. This ultimately prevents jobs from unbounded resource use. In some scenarios the admission controller will reject the pod at admission time and in others the pod dies when it tries to auto-assign some default limit that is insufficient. From what I can tell it’s actually fairly difficult catching these errors - sometimes the pods die silently or it appears that the job never started. The other danger to this is that if the cluster doesn’t have required limit ranges then the configuration may interpret this a permission to use infinite resources - leaving you with an uncomfortable compute bill.

So what I added is a new resource type called scale. This variable allows us to conditionally include resource limits - those limits being equal to the resource requests.

If scale=True(the default), we omit the limits entirely. This is how the plugin currently operates and will allow the pods to scale up as needed.
If scale=False we explicitly set the resource limits for each requested resource type.

Hopefully this logic gives enough control to handle larger/specialized workloads to prevent unintended behavior.

Summary by CodeRabbit

New Features
- Improved resource allocation for Kubernetes container deployments. Resource limits for CPU, memory, ephemeral storage, and GPU are now applied only when scaling is not enabled, offering more flexibility in managing container workloads.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

coderabbitai · 2025-03-06T18:15:46Z

📝 Walkthrough

Walkthrough

The pull request introduces a new variable scale_value that retrieves the "scale" parameter from the resources_dict, defaulting to 1. The logic for setting resource limits in the Kubernetes container configuration is modified, so that limits (CPU, memory, ephemeral storage, GPU) are created only when scaling is not required (i.e., when scale_value is False). Additionally, comments are added to clarify this conditional behavior.

Changes

File	Change Summary
snakemake_executor_plugin_kubernetes/init.py	Introduced `scale_value` from `resources_dict` and modified conditional logic to only set resource limits (CPU, memory, ephemeral storage, GPU) when scaling is not required; added clarifying comments.

Sequence Diagram(s)

sequenceDiagram
    participant ContainerConfig as Container Configuration Builder
    participant Resources as resources_dict
    participant Logic as Scaling Logic

    ContainerConfig->>Resources: Retrieve resources_dict
    Resources-->>ContainerConfig: Provide "scale" parameter (scale_value)
    ContainerConfig->>Logic: Check if scale_value is False
    alt scale_value is False
        Logic->>ContainerConfig: Set resource limits (CPU, memory, ephemeral storage, GPU)
    else scale_value is True
        Logic->>ContainerConfig: Skip setting resource limits
    end

Possibly related PRs

feat: Added GPU node pool support for GKE, better logging, error and exception handling #31: Enhances GPU resource management and adjusts resource requests and limits, which has a strong code-level connection to the conditional checks introduced for resource limits in this PR.

Suggested Reviewers

johanneskoester

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6e3b5ac and 0b7cae6.

📒 Files selected for processing (1)

snakemake_executor_plugin_kubernetes/__init__.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

snakemake_executor_plugin_kubernetes/init.py

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: testing

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (6)

snakemake_executor_plugin_kubernetes/__init__.py (6)
310-314: Implementation of scale parameter for resource limit control

The addition of the scale_value parameter and conditional creation of resource limits is a good approach to address the issue with Kubernetes clusters that have strict admission policies or resource constraints.

However, there are a few improvements that could be made:

A boolean default value (True/False) would be more appropriate than a numeric value (1) for the scale parameter since it's being used in a boolean context.

The parameter name scale is somewhat ambiguous - a more descriptive name like skip_resource_limits or omit_resource_limits might be clearer.
-        scale_value = resources_dict.get("scale", 1)
+        # If skip_limits is True, resource limits will be omitted, allowing pods to scale
+        # If skip_limits is False, explicit resource limits will be set equal to requests
+        skip_limits = resources_dict.get("skip_limits", True)

-        # Only create container.resources.limits if scale is False
-        if not scale_value:
+        # Only create container.resources.limits if skip_limits is False
+        if not skip_limits:
             container.resources.limits = {}
333-334: Fix indentation in ephemeral storage limits block

The indentation in this block is inconsistent with the rest of the code - there's an extra space before "container.resources.limits".
         if not scale_value:
-                 container.resources.limits["ephemeral-storage"] = f"{disk_mb}M"
+            container.resources.limits["ephemeral-storage"] = f"{disk_mb}M"
345-346: Fix indentation in NVIDIA GPU limits block

The indentation in this block is inconsistent with the rest of the code - there's an extra space before "container.resources.limits".
         if not scale_value:
-                 container.resources.limits["nvidia.com/gpu"] = gpu_count
+            container.resources.limits["nvidia.com/gpu"] = gpu_count
350-351: Fix indentation in AMD GPU limits block

The indentation in this block is inconsistent with the rest of the code - there's an extra space before "container.resources.limits".
         if not scale_value:
-                 container.resources.limits["amd.com/gpu"] = gpu_count
+            container.resources.limits["amd.com/gpu"] = gpu_count
357-358: Fix indentation in fallback GPU limits block

The indentation in this block is inconsistent with the rest of the code - there's an extra space before "container.resources.limits".
         if not scale_value:
-                 container.resources.limits["nvidia.com/gpu"] = gpu_count
+            container.resources.limits["nvidia.com/gpu"] = gpu_count
310-359: Consider adding documentation for the scale parameter

The implementation of the scale parameter is a significant feature that allows for more flexibility in resource allocation. However, there is limited documentation explaining its purpose and usage.

Consider adding more comprehensive documentation for this feature, perhaps in a docstring or in the class-level documentation. This would help users understand when and how to use this parameter.

Example documentation:
# Resource scaling
# The 'scale' parameter in resources_dict controls whether resource limits are set:
# - When scale=True (default): Resource limits are omitted, allowing pods to scale as needed
# - When scale=False: Explicit resource limits are set equal to resource requests
# This helps with clusters that have strict admission policies or resource constraints

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 47f16ee and 6e3b5ac.

📒 Files selected for processing (1)

snakemake_executor_plugin_kubernetes/__init__.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

`**/*.py`: Do not try to improve formatting. Do not suggest ...

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

snakemake_executor_plugin_kubernetes/__init__.py

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: testing

🔇 Additional comments (3)

snakemake_executor_plugin_kubernetes/__init__.py (3)

321-322: Conditional CPU limits implementation looks good

The conditional application of CPU limits based on the scale parameter is implemented correctly.

327-328: Conditional memory limits implementation looks good

The conditional application of memory limits based on the scale parameter is implemented correctly.

340-340: Good comment clarification for GPU handling

This clarifying comment helps explain why the code keeps nvidia.com/gpu for both NVIDIA and AMD when the cluster doesn't differentiate between manufacturers.

johanneskoester · 2025-03-06T21:16:11Z

Sorry, I forgot to mention this before merging. Could you please document this here in a follow-up PR?
https://github.com/snakemake/snakemake-executor-plugin-kubernetes/blob/main/docs/further.md

🤖 I have created a release *beep* *boop* --- ## [0.3.2](v0.3.1...v0.3.2) (2025-03-06) ### Bug Fixes * Adding additional logic to handle resource limit requirements ([#38](#38)) ([25819c5](25819c5)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

mrburke00 · 2025-03-06T22:18:54Z

Sure thing

Added documentation relevant to PR [#38](#38)  ## Summary by CodeRabbit - **New Features** - Introduced an optional configuration parameter to toggle between explicit resource limits and automatic scaling for resources such as GPUs, threads, and memory. - **Documentation** - Updated guidance on resource scaling with detailed instructions. - Added a debugging tip to help resolve issues with large job scheduling. - Improved formatting for enhanced clarity.  --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

mrburke00 and others added 9 commits March 5, 2025 16:07

Update __init__.py

535ecb2

Update __init__.py

160421e

Update snakemake_executor_plugin_kubernetes/__init__.py

2bf4c15

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Update snakemake_executor_plugin_kubernetes/__init__.py

1f11816

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Update __init__.py

0cfcca6

poetry

6429830

eof

b764bbf

Update __init__.py

03556e5

Update __init__.py

6e3b5ac

mrburke00 requested a review from johanneskoester March 6, 2025 18:15

coderabbitai bot reviewed Mar 6, 2025

View reviewed changes

Update __init__.py

0b7cae6

johanneskoester approved these changes Mar 6, 2025

View reviewed changes

johanneskoester merged commit 25819c5 into snakemake:main Mar 6, 2025
6 checks passed

github-actions bot mentioned this pull request Mar 6, 2025

chore(main): release 0.3.2 #39

Merged

mrburke00 mentioned this pull request Mar 6, 2025

fix: Added documentation for scale variable #40

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Adding additional logic to handle resource limit requirements #38

fix: Adding additional logic to handle resource limit requirements #38

Uh oh!

mrburke00 commented Mar 6, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 6, 2025 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested Reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

johanneskoester commented Mar 6, 2025

Uh oh!

mrburke00 commented Mar 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: Adding additional logic to handle resource limit requirements #38

fix: Adding additional logic to handle resource limit requirements #38

Uh oh!

Conversation

mrburke00 commented Mar 6, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested Reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

johanneskoester commented Mar 6, 2025

Uh oh!

mrburke00 commented Mar 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mrburke00 commented Mar 6, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 6, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)