-
Notifications
You must be signed in to change notification settings - Fork 12
fix: Adding additional logic to handle resource limit requirements #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
📝 WalkthroughWalkthroughThe pull request introduces a new variable Changes
Sequence Diagram(s)sequenceDiagram
participant ContainerConfig as Container Configuration Builder
participant Resources as resources_dict
participant Logic as Scaling Logic
ContainerConfig->>Resources: Retrieve resources_dict
Resources-->>ContainerConfig: Provide "scale" parameter (scale_value)
ContainerConfig->>Logic: Check if scale_value is False
alt scale_value is False
Logic->>ContainerConfig: Set resource limits (CPU, memory, ephemeral storage, GPU)
else scale_value is True
Logic->>ContainerConfig: Skip setting resource limits
end
Possibly related PRs
Suggested Reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (1)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
snakemake_executor_plugin_kubernetes/__init__.py (6)
310-314: Implementation of scale parameter for resource limit controlThe addition of the
scale_valueparameter and conditional creation of resource limits is a good approach to address the issue with Kubernetes clusters that have strict admission policies or resource constraints.However, there are a few improvements that could be made:
- A boolean default value (True/False) would be more appropriate than a numeric value (1) for the
scaleparameter since it's being used in a boolean context.- The parameter name
scaleis somewhat ambiguous - a more descriptive name likeskip_resource_limitsoromit_resource_limitsmight be clearer.- scale_value = resources_dict.get("scale", 1) + # If skip_limits is True, resource limits will be omitted, allowing pods to scale + # If skip_limits is False, explicit resource limits will be set equal to requests + skip_limits = resources_dict.get("skip_limits", True) - # Only create container.resources.limits if scale is False - if not scale_value: + # Only create container.resources.limits if skip_limits is False + if not skip_limits: container.resources.limits = {}
333-334: Fix indentation in ephemeral storage limits blockThe indentation in this block is inconsistent with the rest of the code - there's an extra space before "container.resources.limits".
if not scale_value: - container.resources.limits["ephemeral-storage"] = f"{disk_mb}M" + container.resources.limits["ephemeral-storage"] = f"{disk_mb}M"
345-346: Fix indentation in NVIDIA GPU limits blockThe indentation in this block is inconsistent with the rest of the code - there's an extra space before "container.resources.limits".
if not scale_value: - container.resources.limits["nvidia.com/gpu"] = gpu_count + container.resources.limits["nvidia.com/gpu"] = gpu_count
350-351: Fix indentation in AMD GPU limits blockThe indentation in this block is inconsistent with the rest of the code - there's an extra space before "container.resources.limits".
if not scale_value: - container.resources.limits["amd.com/gpu"] = gpu_count + container.resources.limits["amd.com/gpu"] = gpu_count
357-358: Fix indentation in fallback GPU limits blockThe indentation in this block is inconsistent with the rest of the code - there's an extra space before "container.resources.limits".
if not scale_value: - container.resources.limits["nvidia.com/gpu"] = gpu_count + container.resources.limits["nvidia.com/gpu"] = gpu_count
310-359: Consider adding documentation for the scale parameterThe implementation of the
scaleparameter is a significant feature that allows for more flexibility in resource allocation. However, there is limited documentation explaining its purpose and usage.Consider adding more comprehensive documentation for this feature, perhaps in a docstring or in the class-level documentation. This would help users understand when and how to use this parameter.
Example documentation:
# Resource scaling # The 'scale' parameter in resources_dict controls whether resource limits are set: # - When scale=True (default): Resource limits are omitted, allowing pods to scale as needed # - When scale=False: Explicit resource limits are set equal to resource requests # This helps with clusters that have strict admission policies or resource constraints
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
snakemake_executor_plugin_kubernetes/__init__.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest ...
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
snakemake_executor_plugin_kubernetes/__init__.py
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: testing
🔇 Additional comments (3)
snakemake_executor_plugin_kubernetes/__init__.py (3)
321-322: Conditional CPU limits implementation looks goodThe conditional application of CPU limits based on the scale parameter is implemented correctly.
327-328: Conditional memory limits implementation looks goodThe conditional application of memory limits based on the scale parameter is implemented correctly.
340-340: Good comment clarification for GPU handlingThis clarifying comment helps explain why the code keeps nvidia.com/gpu for both NVIDIA and AMD when the cluster doesn't differentiate between manufacturers.
|
Sorry, I forgot to mention this before merging. Could you please document this here in a follow-up PR? |
🤖 I have created a release *beep* *boop* --- ## [0.3.2](v0.3.1...v0.3.2) (2025-03-06) ### Bug Fixes * Adding additional logic to handle resource limit requirements ([#38](#38)) ([25819c5](25819c5)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
|
Sure thing |
Added documentation relevant to PR [#38](#38) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced an optional configuration parameter to toggle between explicit resource limits and automatic scaling for resources such as GPUs, threads, and memory. - **Documentation** - Updated guidance on resource scaling with detailed instructions. - Added a debugging tip to help resolve issues with large job scheduling. - Improved formatting for enhanced clarity. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Hey @johanneskoester I found a bug with my GPU code. It turns out that in many default configurations for Kubernetes clusters there is a limit range or some other admission controller requiring both resource requests and resource limits when scaling to very large jobs. This ultimately prevents jobs from unbounded resource use. In some scenarios the admission controller will reject the pod at admission time and in others the pod dies when it tries to auto-assign some default limit that is insufficient. From what I can tell it’s actually fairly difficult catching these errors - sometimes the pods die silently or it appears that the job never started. The other danger to this is that if the cluster doesn’t have required limit ranges then the configuration may interpret this a permission to use infinite resources - leaving you with an uncomfortable compute bill.
So what I added is a new resource type called
scale. This variable allows us to conditionally include resource limits - those limits being equal to the resource requests.scale=True(the default), we omit the limits entirely. This is how the plugin currently operates and will allow the pods to scale up as needed.scale=Falsewe explicitly set the resource limits for each requested resource type.Hopefully this logic gives enough control to handle larger/specialized workloads to prevent unintended behavior.
Summary by CodeRabbit