Skip to content

Conversation

@mrburke00
Copy link
Collaborator

@mrburke00 mrburke00 commented Mar 6, 2025

Hey @johanneskoester I found a bug with my GPU code. It turns out that in many default configurations for Kubernetes clusters there is a limit range or some other admission controller requiring both resource requests and resource limits when scaling to very large jobs. This ultimately prevents jobs from unbounded resource use. In some scenarios the admission controller will reject the pod at admission time and in others the pod dies when it tries to auto-assign some default limit that is insufficient. From what I can tell it’s actually fairly difficult catching these errors - sometimes the pods die silently or it appears that the job never started. The other danger to this is that if the cluster doesn’t have required limit ranges then the configuration may interpret this a permission to use infinite resources - leaving you with an uncomfortable compute bill.

So what I added is a new resource type called scale. This variable allows us to conditionally include resource limits - those limits being equal to the resource requests.

  • If scale=True(the default), we omit the limits entirely. This is how the plugin currently operates and will allow the pods to scale up as needed.
  • If scale=False we explicitly set the resource limits for each requested resource type.

Hopefully this logic gives enough control to handle larger/specialized workloads to prevent unintended behavior.

Summary by CodeRabbit

  • New Features
    • Improved resource allocation for Kubernetes container deployments. Resource limits for CPU, memory, ephemeral storage, and GPU are now applied only when scaling is not enabled, offering more flexibility in managing container workloads.

mrburke00 and others added 9 commits March 5, 2025 16:07
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 6, 2025

📝 Walkthrough

Walkthrough

The pull request introduces a new variable scale_value that retrieves the "scale" parameter from the resources_dict, defaulting to 1. The logic for setting resource limits in the Kubernetes container configuration is modified, so that limits (CPU, memory, ephemeral storage, GPU) are created only when scaling is not required (i.e., when scale_value is False). Additionally, comments are added to clarify this conditional behavior.

Changes

File Change Summary
snakemake_executor_plugin_kubernetes/init.py Introduced scale_value from resources_dict and modified conditional logic to only set resource limits (CPU, memory, ephemeral storage, GPU) when scaling is not required; added clarifying comments.

Sequence Diagram(s)

sequenceDiagram
    participant ContainerConfig as Container Configuration Builder
    participant Resources as resources_dict
    participant Logic as Scaling Logic

    ContainerConfig->>Resources: Retrieve resources_dict
    Resources-->>ContainerConfig: Provide "scale" parameter (scale_value)
    ContainerConfig->>Logic: Check if scale_value is False
    alt scale_value is False
        Logic->>ContainerConfig: Set resource limits (CPU, memory, ephemeral storage, GPU)
    else scale_value is True
        Logic->>ContainerConfig: Skip setting resource limits
    end
Loading

Possibly related PRs

Suggested Reviewers

  • johanneskoester

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6e3b5ac and 0b7cae6.

📒 Files selected for processing (1)
  • snakemake_executor_plugin_kubernetes/__init__.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • snakemake_executor_plugin_kubernetes/init.py
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: testing

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
snakemake_executor_plugin_kubernetes/__init__.py (6)

310-314: Implementation of scale parameter for resource limit control

The addition of the scale_value parameter and conditional creation of resource limits is a good approach to address the issue with Kubernetes clusters that have strict admission policies or resource constraints.

However, there are a few improvements that could be made:

  1. A boolean default value (True/False) would be more appropriate than a numeric value (1) for the scale parameter since it's being used in a boolean context.
  2. The parameter name scale is somewhat ambiguous - a more descriptive name like skip_resource_limits or omit_resource_limits might be clearer.
-        scale_value = resources_dict.get("scale", 1)
+        # If skip_limits is True, resource limits will be omitted, allowing pods to scale
+        # If skip_limits is False, explicit resource limits will be set equal to requests
+        skip_limits = resources_dict.get("skip_limits", True)

-        # Only create container.resources.limits if scale is False
-        if not scale_value:
+        # Only create container.resources.limits if skip_limits is False
+        if not skip_limits:
             container.resources.limits = {}

333-334: Fix indentation in ephemeral storage limits block

The indentation in this block is inconsistent with the rest of the code - there's an extra space before "container.resources.limits".

         if not scale_value:
-                 container.resources.limits["ephemeral-storage"] = f"{disk_mb}M"
+            container.resources.limits["ephemeral-storage"] = f"{disk_mb}M"

345-346: Fix indentation in NVIDIA GPU limits block

The indentation in this block is inconsistent with the rest of the code - there's an extra space before "container.resources.limits".

         if not scale_value:
-                 container.resources.limits["nvidia.com/gpu"] = gpu_count
+            container.resources.limits["nvidia.com/gpu"] = gpu_count

350-351: Fix indentation in AMD GPU limits block

The indentation in this block is inconsistent with the rest of the code - there's an extra space before "container.resources.limits".

         if not scale_value:
-                 container.resources.limits["amd.com/gpu"] = gpu_count
+            container.resources.limits["amd.com/gpu"] = gpu_count

357-358: Fix indentation in fallback GPU limits block

The indentation in this block is inconsistent with the rest of the code - there's an extra space before "container.resources.limits".

         if not scale_value:
-                 container.resources.limits["nvidia.com/gpu"] = gpu_count
+            container.resources.limits["nvidia.com/gpu"] = gpu_count

310-359: Consider adding documentation for the scale parameter

The implementation of the scale parameter is a significant feature that allows for more flexibility in resource allocation. However, there is limited documentation explaining its purpose and usage.

Consider adding more comprehensive documentation for this feature, perhaps in a docstring or in the class-level documentation. This would help users understand when and how to use this parameter.

Example documentation:

# Resource scaling
# The 'scale' parameter in resources_dict controls whether resource limits are set:
# - When scale=True (default): Resource limits are omitted, allowing pods to scale as needed
# - When scale=False: Explicit resource limits are set equal to resource requests
# This helps with clusters that have strict admission policies or resource constraints
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 47f16ee and 6e3b5ac.

📒 Files selected for processing (1)
  • snakemake_executor_plugin_kubernetes/__init__.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest ...

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

  • snakemake_executor_plugin_kubernetes/__init__.py
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: testing
🔇 Additional comments (3)
snakemake_executor_plugin_kubernetes/__init__.py (3)

321-322: Conditional CPU limits implementation looks good

The conditional application of CPU limits based on the scale parameter is implemented correctly.


327-328: Conditional memory limits implementation looks good

The conditional application of memory limits based on the scale parameter is implemented correctly.


340-340: Good comment clarification for GPU handling

This clarifying comment helps explain why the code keeps nvidia.com/gpu for both NVIDIA and AMD when the cluster doesn't differentiate between manufacturers.

@johanneskoester johanneskoester merged commit 25819c5 into snakemake:main Mar 6, 2025
6 checks passed
@johanneskoester
Copy link
Contributor

Sorry, I forgot to mention this before merging. Could you please document this here in a follow-up PR?
https://github.com/snakemake/snakemake-executor-plugin-kubernetes/blob/main/docs/further.md

johanneskoester pushed a commit that referenced this pull request Mar 6, 2025
🤖 I have created a release *beep* *boop*
---


##
[0.3.2](v0.3.1...v0.3.2)
(2025-03-06)


### Bug Fixes

* Adding additional logic to handle resource limit requirements
([#38](#38))
([25819c5](25819c5))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@mrburke00
Copy link
Collaborator Author

Sure thing

mrburke00 added a commit that referenced this pull request Mar 14, 2025
Added documentation relevant to PR
[#38](#38)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Introduced an optional configuration parameter to toggle between
explicit resource limits and automatic scaling for resources such as
GPUs, threads, and memory.

- **Documentation**
  - Updated guidance on resource scaling with detailed instructions.
- Added a debugging tip to help resolve issues with large job
scheduling.
  - Improved formatting for enhanced clarity.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants