-
Notifications
You must be signed in to change notification settings - Fork 12
fix: Adding additional logic to handle resource limit requirements #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThis change introduces a new scaling logic in the Changes
Sequence Diagram(s)sequenceDiagram
participant Executor
participant Resources as Resource Config
participant ResourceAssignment
Executor->>Resources: Retrieve scale_value (default = 1)
alt scale_value is False (or equivalent)
Executor->>ResourceAssignment: Apply limits for CPU, memory, ephemeral storage, GPU
else scale_value is True (or 1)
Executor->>ResourceAssignment: Skip setting resource limits
end
Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (1)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (5)
snakemake_executor_plugin_kubernetes/__init__.py (5)
310-317: Use a boolean default value for clarityThe variable
scale_valueis using1as a default which acts asTruein boolean contexts. For better readability and consistency with the variable's purpose, consider usingTrueas the default value.# NEW SCALE LOGIC: Default is True - do not set any resource limits -scale_value = resources_dict.get("scale", 1) +scale_value = resources_dict.get("scale", True) container.resources = kubernetes.client.V1ResourceRequirements() container.resources.requests = {} # Only create container.resources.limits if scale is False if not scale_value: container.resources.limits = {}Additionally, the initialization of
container.resourcesandcontainer.resources.requestsis repeated here. You already have these lines at 307-308, so you can remove the duplicated initialization.
342-343: Clarify commented guidance about GPU resource namingThe comment about keeping nvidia.com/gpu for both NVIDIA and AMD could be confusing. It suggests using nvidia.com/gpu for AMD GPUs, which contradicts the subsequent code that uses amd.com/gpu for AMD GPUs.
-# But let's keep nvidia.com/gpu for both if the cluster doesn't differentiate. +# Some clusters might not differentiate between manufacturers and only recognize nvidia.com/gpu. # If your AMD plugin uses a different name, update accordingly:
310-361: Consider refactoring resource limit logic for better maintainabilityThe same pattern of checking
if not scale_value:is repeated for each resource type (CPU, memory, ephemeral storage, and different GPU types). This makes the code more verbose and harder to maintain.Consider refactoring this with a helper function. Here's an example:
# NEW SCALE LOGIC: Default is True - do not set any resource limits -scale_value = resources_dict.get("scale", 1) +scale_value = resources_dict.get("scale", True) container.resources = kubernetes.client.V1ResourceRequirements() container.resources.requests = {} # Only create container.resources.limits if scale is False if not scale_value: container.resources.limits = {} +def set_resource(resource_type, value): + """Helper function to set resource requests and conditionally set limits.""" + container.resources.requests[resource_type] = value + if not scale_value: + container.resources.limits[resource_type] = value # CPU and memory requests cores = resources_dict.get("_cores", 1) -container.resources.requests["cpu"] = "{}m".format( - int(cores * self.k8s_cpu_scalar * 1000) -) -if not scale_value: - container.resources.limits["cpu"] = "{}m".format(int(cores * 1000)) +cpu_request = "{}m".format(int(cores * self.k8s_cpu_scalar * 1000)) +cpu_limit = "{}m".format(int(cores * self.k8s_cpu_scalar * 1000)) +container.resources.requests["cpu"] = cpu_request +if not scale_value: + container.resources.limits["cpu"] = cpu_limit if "mem_mb" in resources_dict: mem_mb = resources_dict["mem_mb"] - container.resources.requests["memory"] = "{}M".format(mem_mb) - if not scale_value: - container.resources.limits["memory"] = "{}M".format(mem_mb) + memory_value = "{}M".format(mem_mb) + set_resource("memory", memory_value) # Disk if "disk_mb" in resources_dict: disk_mb = int(resources_dict.get("disk_mb", 1024)) - container.resources.requests["ephemeral-storage"] = f"{disk_mb}M" - if not scale_value: - container.resources.limits["ephemeral-storage"] = f"{disk_mb}M" + storage_value = f"{disk_mb}M" + set_resource("ephemeral-storage", storage_value) # Request GPU resources if specified if "gpu" in resources_dict: gpu_count = str(resources_dict["gpu"]) # For nvidia, K8s expects nvidia.com/gpu; for amd, we use amd.com/gpu. # Some clusters might not differentiate between manufacturers and only recognize nvidia.com/gpu. # If your AMD plugin uses a different name, update accordingly: manufacturer = resources_dict.get("gpu_manufacturer", "").lower() if manufacturer == "nvidia": - container.resources.requests["nvidia.com/gpu"] = gpu_count - if not scale_value: - container.resources.limits["nvidia.com/gpu"] = gpu_count + set_resource("nvidia.com/gpu", gpu_count) self.logger.debug(f"Requested NVIDIA GPU resources: {gpu_count}") elif manufacturer == "amd": - container.resources.requests["amd.com/gpu"] = gpu_count - if not scale_value: - container.resources.limits["amd.com/gpu"] = gpu_count + set_resource("amd.com/gpu", gpu_count) self.logger.debug(f"Requested AMD GPU resources: {gpu_count}") else: # fallback if we never see a recognized manufacturer # (the code above raises an error first, so we might never get here) - container.resources.requests["nvidia.com/gpu"] = gpu_count - if not scale_value: - container.resources.limits["nvidia.com/gpu"] = gpu_count + set_resource("nvidia.com/gpu", gpu_count)Note: This is a substantial change that would require careful testing. The current implementation is functional, but this refactoring would make it more maintainable.
310-317: Enhance documentation for the scale featureThe current comment explains what the scale feature does (not setting resource limits) but doesn't provide context on why this might be useful or when to use it.
-# NEW SCALE LOGIC: Default is True - do not set any resource limits +# SCALE LOGIC: Controls resource limit behavior +# When scale=True (default): Resource limits are omitted, allowing pods to potentially use more resources than requested. +# When scale=False: Resource limits are set equal to resource requests, ensuring pods don't exceed their requested resources. +# This helps when working with clusters that have LimitRange restrictions or when precise resource control is needed. scale_value = resources_dict.get("scale", 1) container.resources = kubernetes.client.V1ResourceRequirements() container.resources.requests = {} # Only create container.resources.limits if scale is False if not scale_value: container.resources.limits = {}
337-337: Fix indentation for consistencyThere's inconsistent indentation after the conditional block for ephemeral storage limits.
if not scale_value: container.resources.limits["ephemeral-storage"] = f"{disk_mb}M" -
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
snakemake_executor_plugin_kubernetes/__init__.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
`**/*.py`: Do not try to improve formatting. Do not suggest ...
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
snakemake_executor_plugin_kubernetes/__init__.py
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: testing
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Hey @johanneskoester I found a bug with my GPU code. It turns out that in many default configurations for Kubernetes clusters there is a limit range or some other admission controller requiring both resource requests and resource limits when scaling to very large jobs. This ultimately prevents jobs from unbounded resource use. In some scenarios the admission controller will reject the pod at admission time and in others the pod dies when it tries to auto-assign some default limit that is insufficient. From what I can tell it’s actually fairly difficult catching these errors - sometimes the pods die silently or it appears that the job never started. The other danger to this is that if the cluster doesn’t have required limit ranges then the configuration may interpret this a permission to use infinite resources - leaving you with an uncomfortable compute bill.
So what I added is a new resource type called
scale. This variable allows us to conditionally include resource limits - those limits being equal to the resource requests.scale=True(the default), we omit the limits entirely. This is how the plugin currently operates and will allow the pods to scale up as needed.scale=Falsewe explicitly set the resource limits for each requested resource type.Hopefully this logic gives enough control to handle larger/specialized workloads to prevent unintended behavior.
Let me know what you think.
Summary by CodeRabbit