scaler fixes by breardon2011 · Pull Request #209 · diggerhq/opencomputer

breardon2011 · 2026-04-30T23:45:25Z

Scaler routing + lifecycle fixes

Closes a cluster of placement-imbalance, drain-coordination, and post-wake memory-accounting issues. Companion to #207 — that one fixed the worker-side reliability story; this one fixes the control-plane side (placement, drain, scale-up) plus a wake-path memory bug whose tail was hitting worker accounting.

All control-plane changes are pure orchestration — no goldenVersion change, no rootfs rebuild. The wake-path change rides any existing rootfs since it doesn't change the snapshot format.

What changes

Fix	Before	After	Files
`routing:count` TTL	15 s — expired between scheduler ticks	60 s — bridges multiple heartbeat windows	`redis_registry.go`
Tie-break smoothing	`Current <= minCount+1` — 60-vs-59 split round-robined → no preference for lighter worker	strict score-based, RR only on genuine ties (eps `1e-9`)	`redis_registry.go`
Placement score	sandbox count only — 1×16 GB and 1×1 GB looked identical	`Current/Capacity + MemPct/100` (actual RSS, not committed)	`redis_registry.go`
Hard-exclude gate	CPU/mem/disk > 90%	≥ 85% — aligns with scaler's 80% evacuation threshold; gradient handled by score below cap	`redis_registry.go`
Slot-capacity guard	routed onto `Current == Capacity` workers	skip when `Current >= Capacity`	`redis_registry.go`
Cross-region eligibility fallback	skipped Draining + slot checks	applies same gates as primary path	`redis_registry.go` (`collectEligibleLocked`)
Cross-CP drain	per-CP in-memory only — operator hits CP-A, CP-B keeps routing to it	Redis `drain:{workerID}` with 24 h TTL; `handleHeartbeat` reads + applies; `SetDraining` writes/DELs	`redis_registry.go`, `admin_workers.go`
Quota-aware scale-up	first quota failure aborts the launch	`compute.ErrQuotaExceeded` sentinel; Azure (7 codes) + EC2 (5 codes) wrap on detect; scaler retries through `MachineSizes` ranked list, short-circuits on non-quota errors, only counts a creation failure once the whole list is exhausted	`scaler.go`, `compute/{pool,azure,ec2}.go`, `config/config.go`, `config/keyvault.go`, `cmd/server/main.go`
Wake virtio-mem replug	`vm.MemoryMB` stamped to ceiling without `qom-set` — guest capped at `baseMem`, phantom committed memory	replays the plug between `LoadVM` and `Cont`; `vm.MemoryMB` and `vm.virtioMemRequestedMB` now reflect actually plugged bytes	`qemu/snapshot.go`, `qemu/manager.go`
Grow-gate admission	committed-memory rejection — over-reserved for idle-but-large sandboxes	`hostUsedMemoryMB()` (real RSS, MemTotal − MemAvailable) with 20% safety margin; mirrors `PrepareMigrationIncoming`'s actual-memory gate	`qemu/manager.go` (`SetResourceLimits`)
128 MB block alignment	`((x + 127) / 128) * 128` open-coded at 5 sites	`alignVirtioMemBlock(mb int) int` helper + `virtioMemBlockSizeMB` const	`qemu/manager.go`, `qemu/snapshot.go`

Configuration knobs

The machine-size fallback ships dormant. To activate it post-merge:

OPENSANDBOX_AZURE_VM_SIZES env var (CSV), or server-azure-vm-sizes in KV
OPENSANDBOX_EC2_INSTANCE_TYPES env var (CSV), or server-ec2-instance-types in KV

Empty values (the merge-day state) preserve pre-PR behavior — single-attempt CreateMachine at the pool's configured default.

Tests

Package	What's covered
`internal/compute/quota_test.go`	`isAzureQuotaErr` × 11 codes, `isEC2QuotaErr` × 7 codes, wrap helpers preserve original error chain
`internal/config/config_test.go`	`splitCSV` × 7 edge cases, `Load` parses fallback lists
`internal/controlplane/scaler_test.go`	Fallback skips quota errors, non-quota short-circuits, all-sizes-fail returns `ErrQuotaExceeded`, empty list uses pool default
`internal/controlplane/redis_registry_test.go`	`SetDraining` writes/clears Redis with TTL; heartbeat applies external drain key; cross-CP propagation; placement skips drain marker (skip-on-no-Redis pattern)
`internal/qemu/virtio_mem_test.go`	`alignVirtioMemBlock` × 13 cases incl. 7 GB wake-replug

Pre-existing TestDrainTimeoutCancelsDrainKeepsWorker failure is on main too (test fixture uses -20m against drainTimeout = 45m) — unrelated, not addressed here.

Live validation

Validated end-to-end on dev (opensandbox-prod RG, westus2):

TTL on routing:count:* keys reads ≤60 s after a placement.
4 sandboxes spread 2/2 across two equally-loaded workers (pre-fix the +1 smoothing allowed all 4 onto one).
Admin drain → drain:{id} Redis key with TTL ≈ 86 400 s; subsequent placements all land on the non-drained worker; undrain DELs the key.
1 GB sandbox scaled to 8 GB → hibernated → woken → guest sees 8 064 MB (pre-fix would have shown ~1 GB).

Backwards compat

Worker binary: not touched.
Existing snapshots: their meta.MemoryMB already records the pre-hibernate ceiling, so the new replug logic Just Works on existing data.
Routing changes: read-mostly; converge within one heartbeat. Mixed-version CPs interoperate during rollout.
Drain: pre-PR CPs don't read the Redis key, so they continue split-braining until upgraded. Post-PR CPs interoperate with old SetDraining calls via the local-mirror path. No flag day.
Machine-size fallback: opt-in via env / KV. Empty = pre-PR behavior.

2027-evals · 2026-05-07T17:34:43Z

⚠️ Couldn't find a preview deployment for commit b0ff762 after 10 minutes.

2027 auto-runs evals against preview deployments of your docs. To enable this, install one of:

Mintlify — if you use Mintlify docs
Vercel — for Next.js / static sites
Netlify — for most static docs

Once a preview is deployed, open a new PR and we'll run the eval automatically.

Evaluating agent experience using 2027.dev · View dashboard

breardon2011 added 3 commits April 30, 2026 16:44

scaler fixes

e48d12d

Merge remote-tracking branch 'origin/main' into scaler-fixes

a4c330b

merge main

b0ff762

breardon2011 marked this pull request as ready for review May 7, 2026 19:12

motatoes approved these changes May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scaler fixes#209

scaler fixes#209
breardon2011 wants to merge 3 commits intomainfrom
scaler-fixes

breardon2011 commented Apr 30, 2026 •

edited

Loading

Uh oh!

2027-evals Bot commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

breardon2011 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scaler routing + lifecycle fixes

What changes

Configuration knobs

Tests

Live validation

Backwards compat

Uh oh!

2027-evals Bot commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

breardon2011 commented Apr 30, 2026 •

edited

Loading