Skip to content

Commit c051060

Browse files
authored
Merge pull request #364 from diggerhq/fix/upsize-default-builder
make multistage built, add adjustable mb
2 parents a6e60b1 + 860728e commit c051060

10 files changed

Lines changed: 299 additions & 73 deletions

File tree

docs/reference/python-sdk/image.mdx

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,18 @@ image = (
103103

104104
**Returns:** `Image`
105105

106+
### `image.builder_memory(mb)`
107+
108+
Sets the RAM (in MB) used during the build phase. Raise this when a build OOMs
109+
(heavy `apt`/`pip`/`npm`). Defaults to **4096**. Does **not** affect the resulting
110+
sandbox's memory — size that at create time via `memory_mb`.
111+
112+
<ParamField body="mb" type="int" required>
113+
Build-phase memory in MB
114+
</ParamField>
115+
116+
**Returns:** `Image`
117+
106118
---
107119

108120
## Utility Methods

docs/reference/typescript-sdk/image.mdx

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,18 @@ const image = Image.base()
115115

116116
**Returns:** `Image`
117117

118+
### `image.builderMemory(mb)`
119+
120+
Sets the RAM (MB) used during the build phase. Raise this when a build OOMs
121+
(heavy `apt`/`pip`/`npm`). Defaults to **4096**. Does **not** affect the resulting
122+
sandbox's memory — size that at create time via `memoryMB`.
123+
124+
<ParamField body="mb" type="number" required>
125+
Build-phase memory in MB
126+
</ParamField>
127+
128+
**Returns:** `Image`
129+
118130
---
119131

120132
## Utility Methods

docs/sandboxes/templates.mdx

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,39 @@ When you pass an `image` to `Sandbox.create()`, the server:
9090
2. If cached, creates the sandbox from the existing checkpoint instantly
9191
3. If not cached, boots a build sandbox, executes each step, checkpoints the result, then creates your sandbox from it
9292

93+
### Build memory
94+
95+
Images build in a **4 GB** sandbox by default. If a build runs out of memory (heavy `apt`/`pip`/`npm`, compiling a large toolchain), raise the build-phase RAM with **`.builderMemory(mb)` / `.builder_memory(mb)`**.
96+
97+
This only affects the build. The resulting image is unchanged — you size the actual sandbox when you create it, via `memoryMB`:
98+
99+
<CodeGroup>
100+
101+
```typescript TypeScript
102+
// 8 GB to build…
103+
const image = Image.base()
104+
.aptInstall(['build-essential', 'cmake'])
105+
.runCommands('make -j')
106+
.builderMemory(8192);
107+
108+
// …but the sandbox runs at whatever you ask for
109+
const sandbox = await Sandbox.create({ image, memoryMB: 4096 });
110+
```
111+
112+
```python Python
113+
image = (
114+
Image.base()
115+
.apt_install(["build-essential", "cmake"])
116+
.run_commands("make -j")
117+
.builder_memory(8192)
118+
)
119+
sandbox = await Sandbox.create(image=image) # size via the HTTP API's memoryMB
120+
```
121+
122+
</CodeGroup>
123+
124+
`builderMemory` doesn't change the cache key — it's a build resource, not image content.
125+
93126
## Creating pre-built snapshots
94127

95128
Create named snapshots that persist permanently and can be shared across sandboxes. Snapshots are visible in the dashboard and don't need to be rebuilt.
@@ -285,6 +318,7 @@ The `Image` class provides a fluent, immutable API for defining sandbox environm
285318
| `.addFile(path, content)` / `.add_file(path, content)` | Embed a file with inline content |
286319
| `.addLocalFile(local, remote)` / `.add_local_file(local, remote)` | Read a local file into the image |
287320
| `.addLocalDir(local, remote)` / `.add_local_dir(local, remote)` | Read a local directory into the image |
321+
| `.builderMemory(mb)` / `.builder_memory(mb)` | RAM for the build phase (default 4 GB; doesn't affect the resulting sandbox) |
288322
| `.toJSON()` / `.to_dict()` | Return the image manifest |
289323
| `.cacheKey()` / `.cache_key()` | Compute SHA-256 content hash |
290324

internal/api/image_builder.go

Lines changed: 68 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,34 @@ type ImageManifest struct {
2323
Base string `json:"base"`
2424
Steps []ImageStep `json:"steps"`
2525
Name string `json:"name,omitempty"` // optional — makes image addressable as a snapshot (for patches, etc.)
26+
27+
// BuilderMemoryMB is the RAM given to the build VM while it runs the steps
28+
// (apt/pip can be memory-hungry; the old 1 GB default OOM'd large builds).
29+
// It does NOT pin the output: the finalize pass re-snapshots at
30+
// RuntimeMemoryMB. Default defaultBuilderMemoryMB. Exposed in the SDKs as
31+
// .builderMemory().
32+
BuilderMemoryMB int `json:"builderMemoryMB,omitempty"`
33+
34+
// RuntimeMemoryMB is the memory floor of the OUTPUT image — the RAM the
35+
// finalize VM is cold-booted + savevm'd at. Forks can't run below it (a warm
36+
// savevm can't shrink past its running size) but can hotplug up. Defaults to
37+
// defaultRuntimeMemoryMB (1 GB); callers size the actual sandbox at create
38+
// time via memoryMB. NOT exposed in the SDKs — an advanced/raw-manifest knob
39+
// for images whose services auto-start heavy at boot.
40+
RuntimeMemoryMB int `json:"runtimeMemoryMB,omitempty"`
2641
}
2742

43+
const (
44+
// defaultBuilderMemoryMB is the build-phase RAM. Enough that common
45+
// apt/pip/npm builds don't OOM; safe because it never pins the output (see
46+
// the finalize pass in buildImage).
47+
defaultBuilderMemoryMB = 4096
48+
49+
// defaultRuntimeMemoryMB is the output image's memory floor — the historical
50+
// 1 GB, so every image forks down to 1 GB and scales up on demand.
51+
defaultRuntimeMemoryMB = 1024
52+
)
53+
2854
// ImageStep is a single build step in an image manifest.
2955
type ImageStep struct {
3056
Type string `json:"type"`
@@ -380,15 +406,36 @@ func (s *Server) buildImage(ctx context.Context, orgID uuid.UUID, manifest *Imag
380406
base = "base"
381407
}
382408

409+
// Build-phase RAM (apt/pip), and the output image's memory floor. Both
410+
// settable per-manifest; defaults preserve "builds get headroom, floor stays
411+
// at the historical 1 GB" (the finalize pass below re-snapshots at the floor).
412+
builderMem := manifest.BuilderMemoryMB
413+
if builderMem <= 0 {
414+
builderMem = defaultBuilderMemoryMB
415+
}
416+
runtimeMem := manifest.RuntimeMemoryMB
417+
if runtimeMem <= 0 {
418+
runtimeMem = defaultRuntimeMemoryMB
419+
}
420+
// Only run the cold-boot finalize pass when it actually lowers the floor.
421+
// If runtime >= builder there's nothing to gain (the in-place savevm already
422+
// floors at builderMem ≤ runtimeMem), so snapshot in place (finalizeMem = 0).
423+
finalizeMem := 0
424+
if runtimeMem < builderMem {
425+
finalizeMem = runtimeMem
426+
}
427+
383428
// Create a throwaway sandbox
384429
buildSandboxID := "sb-build-" + uuid.New().String()[:8]
385430
cfg := types.SandboxConfig{
386431
Template: base,
387432
Timeout: 600, // 10 min max for builds
388433
SandboxID: buildSandboxID,
434+
MemoryMB: builderMem,
389435
}
390436

391-
log.Printf("image-builder: creating build sandbox %s (base=%s, steps=%d)", buildSandboxID, base, len(manifest.Steps))
437+
log.Printf("image-builder: creating build sandbox %s (base=%s, steps=%d, builderMem=%dMB, runtimeMem=%dMB)",
438+
buildSandboxID, base, len(manifest.Steps), builderMem, runtimeMem)
392439

393440
var grpcClient pb.SandboxWorkerClient
394441
var workerID string
@@ -414,6 +461,7 @@ func (s *Server) buildImage(ctx context.Context, orgID uuid.UUID, manifest *Imag
414461
Timeout: int32(cfg.Timeout),
415462
NetworkEnabled: true, // Need network for apt/pip
416463
SandboxId: buildSandboxID,
464+
MemoryMb: int32(builderMem),
417465
})
418466
if err != nil {
419467
return uuid.Nil, fmt.Errorf("failed to create build sandbox: %w", err)
@@ -512,10 +560,10 @@ func (s *Server) buildImage(ctx context.Context, orgID uuid.UUID, manifest *Imag
512560
if s.store != nil {
513561
cfgJSON, _ := json.Marshal(cfg)
514562
cp := &db.Checkpoint{
515-
ID: checkpointID,
516-
SandboxID: buildSandboxID,
517-
OrgID: orgID,
518-
Name: fmt.Sprintf("_image_build_%s", checkpointID.String()[:8]),
563+
ID: checkpointID,
564+
SandboxID: buildSandboxID,
565+
OrgID: orgID,
566+
Name: fmt.Sprintf("_image_build_%s", checkpointID.String()[:8]),
519567
SandboxConfig: cfgJSON,
520568
}
521569
if err := s.store.CreateCheckpoint(ctx, cp); err != nil {
@@ -533,9 +581,10 @@ func (s *Server) buildImage(ctx context.Context, orgID uuid.UUID, manifest *Imag
533581
defer cancel()
534582

535583
resp, err := grpcClient.CreateCheckpoint(cpCtx, &pb.CreateCheckpointRequest{
536-
SandboxId: buildSandboxID,
537-
CheckpointId: checkpointID.String(),
538-
PrepareGolden: true, // prepare golden snapshot for instant template creates
584+
SandboxId: buildSandboxID,
585+
CheckpointId: checkpointID.String(),
586+
PrepareGolden: true, // prepare golden snapshot for instant template creates
587+
FinalizeMemoryMb: int32(finalizeMem), // 0 = snapshot in place; >0 = cold-boot finalize at this floor
539588
})
540589
if err != nil {
541590
return uuid.Nil, fmt.Errorf("failed to checkpoint build sandbox: %w", err)
@@ -557,7 +606,17 @@ func (s *Server) buildImage(ctx context.Context, orgID uuid.UUID, manifest *Imag
557606
return uuid.Nil, fmt.Errorf("manager does not support checkpoints")
558607
}
559608

560-
rootfsKey, workspaceKey, sizeBytes, err := cpMgr.CreateCheckpoint(ctx, buildSandboxID, checkpointID.String(), s.checkpointStore, func() {})
609+
var rootfsKey, workspaceKey string
610+
var sizeBytes int64
611+
var err error
612+
type finalizer interface {
613+
CreateCheckpointFinalized(ctx context.Context, buildSandboxID, checkpointID string, store *storage.CheckpointStore, finalizeMemMB int, onReady func()) (string, string, int64, error)
614+
}
615+
if fz, ok := s.manager.(finalizer); ok && finalizeMem > 0 {
616+
rootfsKey, workspaceKey, sizeBytes, err = fz.CreateCheckpointFinalized(ctx, buildSandboxID, checkpointID.String(), s.checkpointStore, finalizeMem, func() {})
617+
} else {
618+
rootfsKey, workspaceKey, sizeBytes, err = cpMgr.CreateCheckpoint(ctx, buildSandboxID, checkpointID.String(), s.checkpointStore, func() {})
619+
}
561620
if err != nil {
562621
return uuid.Nil, fmt.Errorf("failed to checkpoint build sandbox: %w", err)
563622
}
@@ -693,4 +752,3 @@ func stepDescription(step ImageStep) string {
693752
return step.Type
694753
}
695754
}
696-

internal/qemu/manager.go

Lines changed: 62 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -332,7 +332,7 @@ type Manager struct {
332332
goldenVersion string // hash of base image — used for overlay-based migration
333333

334334
// Metadata service callbacks (set via SetMetadataCallbacks)
335-
onSandboxReady func(sandboxID, guestIP, template string, startedAt time.Time)
335+
onSandboxReady func(sandboxID, guestIP, template string, startedAt time.Time)
336336
onSandboxDestroy func(sandboxID string)
337337
onMigrationOutgoing func(sandboxID string)
338338

@@ -2845,6 +2845,67 @@ func (m *Manager) CheckpointCachePath(checkpointID, filename string) string {
28452845
// checkpoint actually is. Upload failures now propagate as an error rather
28462846
// than being silently logged — the control plane gets the reason and
28472847
// persists it via SetCheckpointFailed (migration 039 added error_msg).
2848+
// CreateCheckpointFinalized produces an image checkpoint whose memory floor is
2849+
// finalizeMemMB, decoupled from the build VM's (larger) build-phase RAM. The
2850+
// image builder uses this so a build can run at, say, 8 GB (apt/pip don't OOM)
2851+
// while the resulting image still forks down to a 1 GB floor.
2852+
//
2853+
// Why two VMs: a savevm captures the *running* memory, and ForkFromCheckpoint
2854+
// floors every fork at that size (shrinking past it would OOM restored
2855+
// processes). So the only way to get a low floor from a high-memory build is to
2856+
// cold-boot a fresh VM from the built disks at the target floor and snapshot
2857+
// THAT. The cold-boot machinery already exists (Create with TemplateRootfsKey).
2858+
//
2859+
// Sequence: sync the build guest's FS → cold-boot a finalize VM from a copy of
2860+
// the build disks at finalizeMemMB → snapshot the finalize VM (the output) →
2861+
// tear the finalize VM down. The build VM is left running for its caller to
2862+
// destroy (Kill would delete the disks we copy from). Cold-boot replays the
2863+
// ext4 journal, so a post-sync disk copy restores cleanly.
2864+
//
2865+
// NOTE: the disk-consistency of the live-disk copy and the finalize cold-boot's
2866+
// agent bring-up are the things to validate on dev before prod.
2867+
func (m *Manager) CreateCheckpointFinalized(ctx context.Context, buildSandboxID, checkpointID string, checkpointStore *storage.CheckpointStore, finalizeMemMB int, onReady func()) (rootfsKey, workspaceKey string, sizeBytes int64, err error) {
2868+
// 1. Flush the build guest's filesystem so the on-disk qcow2 is consistent
2869+
// before we copy it. Best-effort: the cold-boot journal-replays regardless.
2870+
if syncErr := m.SyncFS(ctx, buildSandboxID); syncErr != nil {
2871+
log.Printf("qemu: finalize %s: SyncFS warning: %v (continuing)", buildSandboxID, syncErr)
2872+
}
2873+
2874+
buildDir := filepath.Join(m.cfg.DataDir, "sandboxes", buildSandboxID)
2875+
buildRootfs := filepath.Join(buildDir, "rootfs.qcow2")
2876+
buildWorkspace := filepath.Join(buildDir, "workspace.qcow2")
2877+
if !fileExists(buildRootfs) || !fileExists(buildWorkspace) {
2878+
return "", "", 0, fmt.Errorf("finalize: build disks not found for %s", buildSandboxID)
2879+
}
2880+
2881+
// 2. Cold-boot a fresh finalize VM from a copy of the build disks at the
2882+
// target floor. Create() copies the disks (reflink) via TemplateRootfsKey
2883+
// and cold-boots — no savevm restore, so no memory floor inherited.
2884+
finalizeID := buildSandboxID + "-fin"
2885+
netEnabled := true
2886+
finCfg := types.SandboxConfig{
2887+
SandboxID: finalizeID,
2888+
MemoryMB: finalizeMemMB,
2889+
NetworkEnabled: &netEnabled,
2890+
TemplateRootfsKey: "local://" + buildRootfs,
2891+
TemplateWorkspaceKey: "local://" + buildWorkspace,
2892+
}
2893+
log.Printf("qemu: finalize: cold-booting %s from %s disks at %dMB", finalizeID, buildSandboxID, finalizeMemMB)
2894+
if _, err := m.Create(ctx, finCfg); err != nil {
2895+
return "", "", 0, fmt.Errorf("finalize: cold-boot at %dMB: %w", finalizeMemMB, err)
2896+
}
2897+
// Tear down the ephemeral finalize VM after we snapshot it.
2898+
defer func() {
2899+
if kErr := m.Kill(context.Background(), finalizeID); kErr != nil {
2900+
log.Printf("qemu: finalize %s: cleanup Kill failed: %v", finalizeID, kErr)
2901+
}
2902+
}()
2903+
2904+
// 3. Snapshot the finalize VM → the output checkpoint (floor = finalizeMemMB).
2905+
log.Printf("qemu: finalize %s → checkpoint %s (floor %dMB)", finalizeID, checkpointID, finalizeMemMB)
2906+
return m.CreateCheckpoint(ctx, finalizeID, checkpointID, checkpointStore, onReady)
2907+
}
2908+
28482909
func (m *Manager) CreateCheckpoint(ctx context.Context, sandboxID, checkpointID string, checkpointStore *storage.CheckpointStore, onReady func()) (rootfsKey, workspaceKey string, sizeBytes int64, err error) {
28492910
tStart := time.Now()
28502911
// failureReason is updated at each error site below so the defer can attribute

internal/worker/grpc_server.go

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -882,7 +882,25 @@ func (s *GRPCServer) CreateCheckpoint(ctx context.Context, req *pb.CreateCheckpo
882882
// it's marked "ready" — forks poll for "ready" before downloading.
883883
// The gRPC call returns immediately with the S3 keys — the CP's fork path
884884
// polls for checkpoint readiness and blocks until onReady fires.
885-
rootfsKey, workspaceKey, sizeBytes, err := s.manager.CreateCheckpoint(ctx, req.SandboxId, checkpointID, s.checkpointStore, onReady)
885+
// Finalize: when finalize_memory_mb is set (image builder), produce the
886+
// checkpoint at that memory floor by cold-booting a fresh VM from the build
887+
// disks and snapshotting it — decoupling the image's floor from the build's
888+
// (larger) RAM. Otherwise snapshot the sandbox in place.
889+
var rootfsKey, workspaceKey string
890+
var sizeBytes int64
891+
var err error
892+
if req.FinalizeMemoryMb > 0 {
893+
type finalizer interface {
894+
CreateCheckpointFinalized(ctx context.Context, buildSandboxID, checkpointID string, store *storage.CheckpointStore, finalizeMemMB int, onReady func()) (string, string, int64, error)
895+
}
896+
fz, ok := s.manager.(finalizer)
897+
if !ok {
898+
return nil, fmt.Errorf("manager does not support finalized checkpoints")
899+
}
900+
rootfsKey, workspaceKey, sizeBytes, err = fz.CreateCheckpointFinalized(ctx, req.SandboxId, checkpointID, s.checkpointStore, int(req.FinalizeMemoryMb), onReady)
901+
} else {
902+
rootfsKey, workspaceKey, sizeBytes, err = s.manager.CreateCheckpoint(ctx, req.SandboxId, checkpointID, s.checkpointStore, onReady)
903+
}
886904
if err != nil {
887905
return nil, fmt.Errorf("create checkpoint failed: %w", err)
888906
}

proto/worker/worker.pb.go

Lines changed: 17 additions & 4 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

proto/worker/worker.proto

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -338,6 +338,11 @@ message CreateCheckpointRequest {
338338
string sandbox_id = 1;
339339
string checkpoint_id = 2; // UUID assigned by control plane
340340
bool prepare_golden = 3; // If true, prepare a golden snapshot from this checkpoint
341+
// If > 0, finalize the checkpoint at this memory: the worker stops the build VM,
342+
// cold-boots a fresh VM from its disks at finalize_memory_mb, and snapshots THAT
343+
// — so the image's memory floor is finalize_memory_mb, decoupled from the
344+
// (larger) build-phase RAM. Used by the image builder. 0 = snapshot in place.
345+
int32 finalize_memory_mb = 4;
341346
}
342347

343348
message CreateCheckpointResponse {

0 commit comments

Comments
 (0)