Skip to content

Latest commit

 

History

History
254 lines (208 loc) · 11.1 KB

File metadata and controls

254 lines (208 loc) · 11.1 KB

CubeSandbox bench methodology

Host (read this first if you suspect nested virtualisation)

Both forkd and CubeSandbox were measured on the same bare-metal host. There is no nested virtualisation in this setup:

$ systemd-detect-virt
none
$ grep "model name" /proc/cpuinfo | head -1
model name : 12th Gen Intel(R) Core(TM) i7-12700
$ grep -o vmx /proc/cpuinfo | head -1
vmx

12th-gen Intel Core, VT-x available directly, Ubuntu 24.04 / Linux 6.14 running on the metal. Every microVM in either project is host → L1 KVM guest, same level for both. CubeSandbox was not run inside a dev-env VM or any other intermediate hypervisor; the one-click install script targets the host directly (see "Setup" below).

TL;DR

Path N=100 wall-clock Success Per-sandbox
Fast (pool entry reused) 1,056 ± 14 ms (5-run mean) 100 % 10.6 ms
Slow (live mkfs.ext4 + reflink-copy) 20,304 ms 77 % 263 ms

Same bare-metal host for both (i7-12700, 20 vCPU, no nested virt). The slow-path row is what shipped first because the bench template used a 2 GiB writable-layer size that didn't match pool_default_format_size_list (default ["1Gi"]); the maintainer clarified the distinction at #235 and we re-ran on 2026-05-14 with ["1Gi", "2Gi"]. The "Fast-path replay" section at the bottom has the full small-N curve and the config tweak required to stabilise pool warm-up on this host.

Result (slow path)

CubeSandbox N=100 spawn measured at 20,304 ms on the same dev box forkd was measured on (Ubuntu 24.04 / Linux 6.14 / 20 vCPU / 30 GiB / KVM). 77 of 100 sandboxes spawned cleanly; the rest hit newExt4RawByReflinkCopy failed: e2fsck 1.47.0 (5-Feb-2023): bad magic number in superblock under concurrent load. The wall-clock figure is the full N=100 run including the failed-spawn rollbacks.

Setup

# CubeSandbox v0.2.0 one-click install with custom ports.
# Patches applied on this host (1Panel-occupied default ports):
#   CubeMaster/conf.yaml — replace 127.0.0.1:3306 → :13306
#   CubeMaster/conf.yaml — replace 127.0.0.1:6379 → :16379
sudo bash /opt/cube-stage/cube-sandbox-one-click-9c16021/install.sh
# After install, port + service patches above, then:
sudo /usr/local/services/cubetoolbox/scripts/one-click/up.sh

# Build a template once (cached afterwards):
cubemastercli template create-from-image \
    --image python:3.12-slim \
    --template-id forkd-bench-pynp \
    --writable-layer-size 2Gi \
    --allow-internet-access

The cube-api listens on port 6000 (we overrode CUBE_API_BIND).

Workload

bench/cube-bench.py (see compare-all.py) issues N concurrent POST /sandboxes {"templateID":"forkd-bench-pynp"} via the cube-api REST endpoint, then DELETE /sandboxes/:id per successful spawn. The numpy import workload runs inside each sandbox but most fail before they get there because of the storage issue noted below.

Why success rate is < 100 % on this host (slow path)

Under concurrent load, newExt4RawByReflinkCopy reports a corrupt ext4 superblock on the per-sandbox writable layer. The XFS filesystem hosting /data/cubelet has reflink=1 enabled (it's a loop-mounted /var/cube-xfs.img; xfs_info confirms) and the host has plenty of free space, so this isn't filesystem or capacity-driven.

Subsequent investigation (see "Fast-path replay" below) traced the real cause to mkfs.ext4 timing out under cubelet's default pool_worker_num = 8 against the hard-coded cmdTimeout = 3 s in storage/shell.go. Two cubelet instances both formatting 2 GiB images concurrently can push individual mkfs.ext4 invocations past 3 s, the ExecV context cancels the command mid-write, and the next reader sees a half-baked superblock. The "bad magic number" message is the visible symptom; the timeout race is the cause. PRs #236 (make cmd_timeout configurable) and #237 (diagnostic context on failure) target this directly.

A second N=100 run measured 20,304 ms / 77 succeeded; the first run measured 19,788 ms / 36 succeeded. Wall-clock is stable; success rate is variable. The chart row uses the more recent figure.

Notes

Tencent's published numbers ("<60 ms" cold-start, "<150 ms under concurrent") would put CubeSandbox ahead of forkd on raw cold-start. On the specific Ubuntu 24.04 / Linux 6.14 / 20-vCPU host we tested, the storage path was the bottleneck, not VM boot. A cleaner host (no 1Panel co-tenancy, dedicated XFS partition for /data/cubelet) is likely to give CubeSandbox a substantially better number.

Upstream response (2026-05-14)

We filed the methodology + the reflink-copy race upstream: TencentCloud/CubeSandbox#235. The maintainer's response confirmed two things that recontextualise the numbers above:

  1. The race is on a slow code path the original template inadvertently selected. CubeSandbox pre-formats a pool of writable-layer ext4 images at sizes listed in pool_default_format_size_list (default ["1Gi"]). A sandbox whose writable_layer_size matches one of those sizes reuses a pool entry — fast path, no mkfs.ext4 or reflink-copy per sandbox. We passed --writable-layer-size 2Gi, which doesn't match the default pool, so every sandbox went through the live mkfs.ext4 + reflink-copy slow path. That's where the bad-magic race lives.
  2. Cube's published N=50/N=100 numbers are measured on a 96 vCPU server. A 20 vCPU host (this dev box) is outside their tested matrix. Per the maintainer: P99 under 200 ms at N=100 on a 96-vCPU node.

Cube also accepted the first two improvements from our issue (a configurable cmdTimeout, and richer diagnostic info on newExt4RawByReflinkCopy failures) and is reviewing the third (drop per-clone e2fsck).

Small-N replay on the same (slow-path) configuration

After the upstream exchange we re-ran with the same 2 GiB template at smaller N — staying on the slow path so the comparison is apples-to-apples with the N=100 row, but small enough to fit the 30 GiB host RAM budget (template spec = 2 GiB per sandbox → max ~14 concurrent).

Script: bench/cube-replay.sh.

N Succeeded Wall-clock Per-sandbox
1 1/1 924 ms 924 ms
5 5/5 2,207 ms 441 ms
10 10/10 2,567 ms 257 ms

Observations:

  • 100 % success rate at every size we measured. The reflink-copy race only fired at N=100 with the 2 GiB writable layer; smaller N hit no failures.
  • Single-instance cold start ≈ 924 ms here, vs Cube's published fast-path <60 ms. The ~15× gap is the combined cost of the slow path (live mkfs.ext4 plus reflink-copy of a 2 GiB image) and the host being well outside their 96 vCPU testing matrix.
  • Per-sandbox cost shrinks substantially with concurrency (924 → 441 → 257 ms / sandbox) — pipelined work the original 20.3 s / 100 = 203 ms-per-sandbox number is consistent with.

What we did not measure here: the fast path (writable_layer_size matching pool_default_format_size_list). Doing so would require either a new template with a 1 GiB writable layer or reconfiguring the pool for 2 GiB; we left it for whenever either Cube or a downstream user wants a head-to-head fast-path number on this host.

Fast-path replay (2026-05-14)

After the upstream exchange we reconfigured the pool to include the template's writable-layer size and re-ran the bench. Two config edits in /usr/local/services/cubetoolbox/Cubelet/config/config.toml under [plugins."io.cubelet.internal.v1.storage"]:

pool_default_format_size_list = ["1Gi", "2Gi"]   # was ["1Gi"]
pool_worker_num               = 1                # was 8

The first edit is what the maintainer was pointing at — 2Gi now takes the fast path (no per-sandbox mkfs.ext4 or reflink-copy). The second is a workaround for the cmdTimeout race described above: with 8 workers, pool warm-up at ~2 GiB images races itself into corruption before the bench even starts. With one worker, each mkfs.ext4 runs alone and finishes well inside the 3 s budget. PR #236 makes the timeout itself configurable, which is the right long-term fix.

After restart and pool warm to 100 entries, five consecutive runs of an improved bench/cube-bench.py against forkd-bench-pynp. The improved script pre-warms Python's default ThreadPoolExecutor (so its lazy-init isn't charged to N=1) and reports per-call latency on top of wall-clock:

Phase N Wall-clock (mean ± σ over 5 runs) Notes
cold-server 1 184 ± 17 ms first call after cubelet restart
warm-server 1 156 ± 7 ms repeated single-call
ramp 10 212 ± 3 ms ≈ cold N=1; 20 vCPUs still have headroom
ramp 50 542 ± 11 ms 20-vCPU ceiling starts to bind
ramp 100 1056 ± 14 ms per-sandbox amortised ≈ 10.6 ms

100 % success at every N, every run.

Observations:

  • N=1 ≈ N=10 wall-clock. Below the 20-vCPU ceiling the wall-clock is dominated by the slowest single sandbox-boot, not by the number of concurrent boots. Once N saturates the cores (≥ 50), per-sandbox amortised cost stabilises around 10–11 ms — close to the wall-time of one warm VM boot divided across the available parallelism.
  • ~55 ms cold-start delta on the first request after a quiet cubelet (184 → 156 ms). The CubeSandbox maintainer noted at #235 that cube v0.2.0 shipped with a ~50 ms latency regression that PR #234 fixes in v0.2.1. Our observed delta is consistent with that. Numbers in the table are valid for the v0.2.0 baseline we tested; v0.2.1 would shift each row down by roughly that amount. We did not retest on v0.2.1.
  • N=100 wall-clock 1.04–1.07 s — about 19× faster than the slow-path run on the same host, well inside Cube's published "<150 ms under concurrent" envelope at ~10.6 ms / sandbox amortised.
  • The N=1 figures here are still well above Cube's advertised "<60 ms" single-instance cold-start — that number was measured on a 96 vCPU host with the snapshot/CoW path warm. We didn't retest that shape.

A first round of measurements posted earlier in #235 reported N=100 = 1,439–1,480 ms / 100 % succ and N=1 = 385 ms. Both figures were inflated by two artifacts:

  • The original bench script lazy-initialized Python's default ThreadPoolExecutor on the first run_in_executor call, charging ~50–100 ms to the N=1 measurement.
  • A stale cubemaster reconcile-retry loop was burning CPU during the first batch of runs (we'd previously killed cubelet for debugging without taking down cubemaster), adding background contention to every measurement.

The numbers in the table above remove both biases.