Sandbox Inplace CPU Resize

Title

Summary

This enhancement proposes enabling in-place CPU resizing for sandboxes allocated from the warm pool through a metadata-based approach. When a sandbox is claimed via the E2B API, users can specify a CPU scale factor in the metadata (e.g., e2b.agents.kruise.io/cpu-scale-factor: 2). The sandbox manager will automatically resize the allocated sandbox's CPU resources in-place using Kubernetes' pod resize subResource, allowing the warm pool to maintain minimal resource configurations while enabling on-demand CPU scaling for claimed sandboxes.

Key Benefits:

Cost Optimization: Maintain warm pools with minimal CPU resources, scaling up only when sandboxes are actually claimed
Zero Downtime: In-place CPU resizing without pod restart or recreation

Motivation

Problem Statement

Currently, the warm pool management strategy requires maintaining sandboxes with sufficient resources to handle peak workloads. This leads to:

High Resource Costs: Warm pools must be provisioned with resources sufficient for the maximum expected workload, even though most sandboxes may not need peak resources immediately
Inefficient Resource Utilization: Sandboxes sit idle in the warm pool consuming resources that may never be fully utilized
Limited Flexibility: Once a sandbox is allocated, its resources cannot be adjusted without recreation, which causes downtime

Goals

Enable Metadata-Based CPU Scaling: Allow users to specify CPU scale factor via E2B API metadata when creating sandboxes
In-Place Resize: Leverage Kubernetes pod resize subResource to resize CPU without pod restart
Early Return Support: Optionally return sandbox immediately once resize feasibility is confirmed

Non-Goals/Future Work

Automatic Scaling: This does not implement automatic CPU scaling based on workload metrics
Resize Policy Configuration: Users cannot configure resize policies (always uses NotRequired restart policy)

Proposal

API Changes

E2B API Metadata Extension

The existing E2B CreateSandbox API already accepts a metadata field. This enhancement adds support for a new metadata key:

metadata:
  e2b.agents.kruise.io/cpu-scale-factor: "2"  # String representation of a positive number

Metadata Key: e2b.agents.kruise.io/cpu-scale-factor

Type: String (must be parseable as a positive float64)
Validation: Must be > 0, typically in range [1, 10] for practical use
Default: If not specified, no resize is performed (backward compatible)

Design Details

Metadata-based CPU Scale Factor

When a sandbox is claimed via CreateSandbox API:

Metadata Parsing: Sandbox manager checks for e2b.agents.kruise.io/cpu-scale-factor in the request metadata
CPU Calculation: If present, calculate target CPU as originalCPU * scaleFactor
Validation: Validate that the target CPU is within acceptable bounds (respects pod limits, resource quotas, etc.)
Resize Trigger: If validation passes, trigger pod resize via Kubernetes /resize subResource

Example Flow:

Original Sandbox CPU: 1 core
Metadata: e2b.agents.kruise.io/cpu-scale-factor: "2"
Target CPU: 1 * 2 = 2 cores
Action: Resize pod from 1 core to 2 cores

Sandbox Manager Resize Logic

The resize logic is implemented in the sandbox manager's ClaimSandbox flow:

After Sandbox Claim: Once a sandbox is successfully claimed from the pool
Metadata Check: Check if cpu-scale-factor metadata exists
Current CPU Detection: Read current CPU from pod spec or status
Target Calculation: Calculate target CPU = current * scaleFactor 5Resize Execution: Call Kubernetes pod /resize subresource 6Status Monitoring: Monitor pod conditions for resize progress

Early Return on Resize Feasibility

Optional Feature: Once the system confirms that resize is feasible (PodResizingInProgress condition is set), the sandbox can be returned to the user immediately, even if the resize is still in progress.

Condition Check:

Monitor for PodResizingInProgress condition in pod status
Once condition is True, resize is confirmed feasible by kubelet
The condition indicates that:
- Kubelet has accepted the resize request
- Resource allocation has been updated
- Resize is being actuated (may still be in progress)
Return sandbox to user with status indicating resize in progress
User can start using sandbox while CPU resize completes asynchronously

Flow Diagram

User Request (CreateSandbox)
    |
    v
[Parse Metadata]
    |
    v
[cpu-scale-factor present?]
    | No                    Yes
    |  |                     |
    |  v                     v
    |  [Return Sandbox]  [Calculate Target CPU]
    |                          |
    |                          v
    |                     [Validate Feasibility]
    |                          |
    |                    [Infeasible?]
    |                    Yes /  \ No
    |                      |     |
    |                      v     v
    |              [Return Error] [Call Pod /resize]
    |                                 |
    |                                 v
    |                          [Monitor Conditions]
    |                                 |
    |                    [PodResizingInProgress?]
    |                    Yes /          \ No
    |                      |             |
    |                      v             v
    |          [Early Return?]    [Wait for Completion]
    |          Yes /    \ No            |
    |            |       |              |
    |            v       v              v
    |    [Return Sandbox] [Wait]  [Return Sandbox]
    |            |       |              |
    |            +-------+--------------+
    |                     |
    |                     v
    |            [Resize Completes Async]

User Stories

Cost-Optimized Warm Pool

As a platform operator, I want to maintain warm pools with minimal CPU resources (0.5 cores) to reduce costs. When an agent claims a sandbox for a compute-intensive task, I want the sandbox to automatically scale to 4 cores in-place without downtime.

Task-Based Resource Allocation

As an agent developer, I want to specify CPU requirements when claiming a sandbox based on my task's computational needs, so that I get appropriate resources without over-provisioning.

Immediate Sandbox Availability

As an agent developer, I want to receive the sandbox immediately once the system confirms that CPU resize is feasible, even if the resize is still in progress, so that I can start using the sandbox without waiting for resize completion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sandbox Inplace CPU Resize

Table of Contents

Summary

Motivation

Problem Statement

Goals

Non-Goals/Future Work

Proposal

API Changes

E2B API Metadata Extension

Design Details

Metadata-based CPU Scale Factor

Sandbox Manager Resize Logic

Early Return on Resize Feasibility

Flow Diagram

User Stories

Cost-Optimized Warm Pool

Task-Based Resource Allocation

Immediate Sandbox Availability

Implementation Details/Notes/Constraints

Risks and Mitigations

Alternatives

Upgrade Strategy

Additional Details

Test Plan [optional]

Implementation History

FilesExpand file tree

20260113-sandbox-inplace-cpu-resize.md

Latest commit

History

20260113-sandbox-inplace-cpu-resize.md

File metadata and controls

Sandbox Inplace CPU Resize

Table of Contents

Summary

Motivation

Problem Statement

Goals

Non-Goals/Future Work

Proposal

API Changes

E2B API Metadata Extension

Design Details

Metadata-based CPU Scale Factor

Sandbox Manager Resize Logic

Early Return on Resize Feasibility

Flow Diagram

User Stories

Cost-Optimized Warm Pool

Task-Based Resource Allocation

Immediate Sandbox Availability

Implementation Details/Notes/Constraints

Risks and Mitigations

Alternatives

Upgrade Strategy

Additional Details

Test Plan [optional]

Implementation History