-
Notifications
You must be signed in to change notification settings - Fork 37
WIP: initial proposals of supporting in-place cpu resize #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,231 @@ | ||
| --- | ||
| title: sandbox inplace cpu resize | ||
| authors: | ||
| - "@sivanzcw" | ||
| reviewers: | ||
| - "@furykerry" | ||
| creation-date: 2026-01-13 | ||
| last-updated: 2026-01-13 | ||
| status: implementable | ||
| see-also: | ||
| replaces: | ||
| superseded-by: | ||
| --- | ||
|
|
||
| # Sandbox Inplace CPU Resize | ||
|
|
||
| ## Table of Contents | ||
|
|
||
| - [Title](#title) | ||
| - [Table of Contents](#table-of-contents) | ||
| - [Glossary](#glossary) | ||
| - [Summary](#summary) | ||
| - [Motivation](#motivation) | ||
| - [Goals](#goals) | ||
| - [Non-Goals/Future Work](#non-goalsfuture-work) | ||
| - [Proposal](#proposal) | ||
| - [User Stories](#user-stories) | ||
| - [Story 1](#story-1) | ||
| - [Story 2](#story-2) | ||
| - [Requirements (Optional)](#requirements-optional) | ||
| - [Functional Requirements](#functional-requirements) | ||
| - [FR1](#fr1) | ||
| - [FR2](#fr2) | ||
| - [Non-Functional Requirements](#non-functional-requirements) | ||
| - [NFR1](#nfr1) | ||
| - [NFR2](#nfr2) | ||
| - [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) | ||
| - [Risks and Mitigations](#risks-and-mitigations) | ||
| - [Alternatives](#alternatives) | ||
| - [Upgrade Strategy](#upgrade-strategy) | ||
| - [Additional Details](#additional-details) | ||
| - [Test Plan [optional]](#test-plan-optional) | ||
| - [Implementation History](#implementation-history) | ||
|
|
||
| ## Summary | ||
|
|
||
| This enhancement proposes enabling in-place CPU resizing for sandboxes | ||
| allocated from the warm pool through a metadata-based approach. | ||
| When a sandbox is claimed via the E2B API, users can specify a CPU scale factor in the metadata | ||
| (e.g., `e2b.agents.kruise.io/cpu-scale-factor: 2`). | ||
| The sandbox manager will automatically resize the allocated sandbox's CPU resources | ||
| in-place using Kubernetes' pod resize subResource, allowing the warm pool to maintain | ||
| minimal resource configurations while enabling on-demand CPU scaling for claimed sandboxes. | ||
|
|
||
| **Key Benefits**: | ||
| - **Cost Optimization**: Maintain warm pools with minimal CPU resources, | ||
| scaling up only when sandboxes are actually claimed | ||
| - **Zero Downtime**: In-place CPU resizing without pod restart or recreation | ||
|
|
||
| ## Motivation | ||
|
|
||
| ### Problem Statement | ||
|
|
||
| Currently, the warm pool management strategy requires maintaining sandboxes | ||
| with sufficient resources to handle peak workloads. This leads to: | ||
|
|
||
| 1. **High Resource Costs**: Warm pools must be provisioned with resources sufficient | ||
| for the maximum expected workload, even though most sandboxes may not need peak resources immediately | ||
| 2. **Inefficient Resource Utilization**: Sandboxes sit idle in the warm pool consuming resources | ||
| that may never be fully utilized | ||
| 3. **Limited Flexibility**: Once a sandbox is allocated, its resources cannot be adjusted without recreation, | ||
| which causes downtime | ||
|
|
||
| ### Goals | ||
|
|
||
| 1. **Enable Metadata-Based CPU Scaling**: Allow users to specify CPU scale factor | ||
| via E2B API metadata when creating sandboxes | ||
| 2. **In-Place Resize**: Leverage Kubernetes pod resize subResource to resize CPU without pod restart | ||
| 3. **Early Return Support**: Optionally return sandbox immediately | ||
| once resize feasibility is confirmed | ||
|
|
||
| ### Non-Goals/Future Work | ||
|
|
||
| 1. **Automatic Scaling**: This does not implement automatic CPU scaling based on workload metrics | ||
| 2. **Resize Policy Configuration**: Users cannot configure resize policies | ||
| (always uses `NotRequired` restart policy) | ||
|
|
||
| ## Proposal | ||
|
|
||
| ### API Changes | ||
|
|
||
| #### E2B API Metadata Extension | ||
|
|
||
| The existing E2B `CreateSandbox` API already accepts a `metadata` field. | ||
| This enhancement adds support for a new metadata key: | ||
|
|
||
| ```yaml | ||
| metadata: | ||
| e2b.agents.kruise.io/cpu-scale-factor: "2" # String representation of a positive number | ||
| ``` | ||
|
|
||
| **Metadata Key**: `e2b.agents.kruise.io/cpu-scale-factor` | ||
| - **Type**: String (must be parseable as a positive float64) | ||
| - **Validation**: Must be > 0, typically in range [1, 10] for practical use | ||
| - **Default**: If not specified, no resize is performed (backward compatible) | ||
|
|
||
| ### Design Details | ||
|
|
||
| #### Metadata-based CPU Scale Factor | ||
|
|
||
| When a sandbox is claimed via `CreateSandbox` API: | ||
|
|
||
| 1. **Metadata Parsing**: Sandbox manager checks for `e2b.agents.kruise.io/cpu-scale-factor` in the request metadata | ||
| 2. **CPU Calculation**: If present, calculate target CPU as `originalCPU * scaleFactor` | ||
| 3. **Validation**: Validate that the target CPU is within acceptable bounds (respects pod limits, resource quotas, etc.) | ||
| 4. **Resize Trigger**: If validation passes, trigger pod resize via Kubernetes `/resize` subResource | ||
|
|
||
| **Example Flow**: | ||
| ``` | ||
| Original Sandbox CPU: 1 core | ||
| Metadata: e2b.agents.kruise.io/cpu-scale-factor: "2" | ||
| Target CPU: 1 * 2 = 2 cores | ||
| Action: Resize pod from 1 core to 2 cores | ||
| ``` | ||
|
|
||
| #### Sandbox Manager Resize Logic | ||
|
|
||
| The resize logic is implemented in the sandbox manager's `ClaimSandbox` flow: | ||
|
|
||
| 1. **After Sandbox Claim**: Once a sandbox is successfully claimed from the pool | ||
| 2. **Metadata Check**: Check if `cpu-scale-factor` metadata exists | ||
| 3. **Current CPU Detection**: Read current CPU from pod spec or status | ||
| 4. **Target Calculation**: Calculate target CPU = current * scaleFactor | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pls elaborate the calculation for multi-container and burstable QoS:
|
||
| 5**Resize Execution**: Call Kubernetes pod `/resize` subresource | ||
| 6**Status Monitoring**: Monitor pod conditions for resize progress | ||
|
|
||
| #### Early Return on Resize Feasibility | ||
|
|
||
| **Optional Feature**: Once the system confirms that resize is feasible | ||
| (PodResizingInProgress condition is set), the sandbox can be returned to the user immediately, | ||
| even if the resize is still in progress. | ||
|
|
||
| **Condition Check**: | ||
| - Monitor for `PodResizingInProgress` condition in pod status | ||
| - Once condition is `True`, resize is confirmed feasible by kubelet | ||
| - The condition indicates that: | ||
| - Kubelet has accepted the resize request | ||
| - Resource allocation has been updated | ||
| - Resize is being actuated (may still be in progress) | ||
| - Return sandbox to user with status indicating resize in progress | ||
| - User can start using sandbox while CPU resize completes asynchronously | ||
|
|
||
| #### Flow Diagram | ||
|
|
||
| ``` | ||
| User Request (CreateSandbox) | ||
| | | ||
| v | ||
| [Parse Metadata] | ||
| | | ||
| v | ||
| [cpu-scale-factor present?] | ||
| | No Yes | ||
| | | | | ||
| | v v | ||
| | [Return Sandbox] [Calculate Target CPU] | ||
| | | | ||
| | v | ||
| | [Validate Feasibility] | ||
| | | | ||
| | [Infeasible?] | ||
| | Yes / \ No | ||
| | | | | ||
| | v v | ||
| | [Return Error] [Call Pod /resize] | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if resize is infeasible, shall we continue to claim other sandbox? plz elaborate the error handling logic. |
||
| | | | ||
| | v | ||
| | [Monitor Conditions] | ||
| | | | ||
| | [PodResizingInProgress?] | ||
| | Yes / \ No | ||
| | | | | ||
| | v v | ||
| | [Early Return?] [Wait for Completion] | ||
| | Yes / \ No | | ||
| | | | | | ||
| | v v v | ||
| | [Return Sandbox] [Wait] [Return Sandbox] | ||
| | | | | | ||
| | +-------+--------------+ | ||
| | | | ||
| | v | ||
| | [Resize Completes Async] | ||
| ``` | ||
|
|
||
| ### User Stories | ||
|
|
||
| #### Cost-Optimized Warm Pool | ||
|
|
||
| As a platform operator, I want to maintain warm pools with minimal CPU resources (0.5 cores) to reduce costs. | ||
| When an agent claims a sandbox for a compute-intensive task, | ||
| I want the sandbox to automatically scale to 4 cores in-place without downtime. | ||
|
|
||
| #### Task-Based Resource Allocation | ||
|
|
||
| As an agent developer, I want to specify CPU requirements | ||
| when claiming a sandbox based on my task's computational needs, | ||
| so that I get appropriate resources without over-provisioning. | ||
|
|
||
| #### Immediate Sandbox Availability | ||
|
|
||
| As an agent developer, I want to receive the sandbox immediately once | ||
| the system confirms that CPU resize is feasible, even if the resize is still in progress, | ||
| so that I can start using the sandbox without waiting for resize completion. | ||
|
|
||
| ### Implementation Details/Notes/Constraints | ||
|
|
||
|
|
||
| ### Risks and Mitigations | ||
|
|
||
| ## Alternatives | ||
|
|
||
| ## Upgrade Strategy | ||
|
|
||
| ## Additional Details | ||
|
|
||
| ### Test Plan [optional] | ||
|
|
||
| ## Implementation History | ||
|
|
||
| - [ ] 13/01/2026: Initial proposals draft created | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we introduce an annotation or featuregate to enable this support