Skip to content

scaler fixes#209

Open
breardon2011 wants to merge 3 commits intomainfrom
scaler-fixes
Open

scaler fixes#209
breardon2011 wants to merge 3 commits intomainfrom
scaler-fixes

Conversation

@breardon2011
Copy link
Copy Markdown
Contributor

@breardon2011 breardon2011 commented Apr 30, 2026

Scaler routing + lifecycle fixes

Closes a cluster of placement-imbalance, drain-coordination, and post-wake memory-accounting issues. Companion to #207 — that one fixed the worker-side reliability story; this one fixes the control-plane side (placement, drain, scale-up) plus a wake-path memory bug whose tail was hitting worker accounting.

All control-plane changes are pure orchestration — no goldenVersion change, no rootfs rebuild. The wake-path change rides any existing rootfs since it doesn't change the snapshot format.

What changes

Fix Before After Files
routing:count TTL 15 s — expired between scheduler ticks 60 s — bridges multiple heartbeat windows redis_registry.go
Tie-break smoothing Current <= minCount+1 — 60-vs-59 split round-robined → no preference for lighter worker strict score-based, RR only on genuine ties (eps 1e-9) redis_registry.go
Placement score sandbox count only — 1×16 GB and 1×1 GB looked identical Current/Capacity + MemPct/100 (actual RSS, not committed) redis_registry.go
Hard-exclude gate CPU/mem/disk > 90% ≥ 85% — aligns with scaler's 80% evacuation threshold; gradient handled by score below cap redis_registry.go
Slot-capacity guard routed onto Current == Capacity workers skip when Current >= Capacity redis_registry.go
Cross-region eligibility fallback skipped Draining + slot checks applies same gates as primary path redis_registry.go (collectEligibleLocked)
Cross-CP drain per-CP in-memory only — operator hits CP-A, CP-B keeps routing to it Redis drain:{workerID} with 24 h TTL; handleHeartbeat reads + applies; SetDraining writes/DELs redis_registry.go, admin_workers.go
Quota-aware scale-up first quota failure aborts the launch compute.ErrQuotaExceeded sentinel; Azure (7 codes) + EC2 (5 codes) wrap on detect; scaler retries through MachineSizes ranked list, short-circuits on non-quota errors, only counts a creation failure once the whole list is exhausted scaler.go, compute/{pool,azure,ec2}.go, config/config.go, config/keyvault.go, cmd/server/main.go
Wake virtio-mem replug vm.MemoryMB stamped to ceiling without qom-set — guest capped at baseMem, phantom committed memory replays the plug between LoadVM and Cont; vm.MemoryMB and vm.virtioMemRequestedMB now reflect actually plugged bytes qemu/snapshot.go, qemu/manager.go
Grow-gate admission committed-memory rejection — over-reserved for idle-but-large sandboxes hostUsedMemoryMB() (real RSS, MemTotal − MemAvailable) with 20% safety margin; mirrors PrepareMigrationIncoming's actual-memory gate qemu/manager.go (SetResourceLimits)
128 MB block alignment ((x + 127) / 128) * 128 open-coded at 5 sites alignVirtioMemBlock(mb int) int helper + virtioMemBlockSizeMB const qemu/manager.go, qemu/snapshot.go

Configuration knobs

The machine-size fallback ships dormant. To activate it post-merge:

  • OPENSANDBOX_AZURE_VM_SIZES env var (CSV), or server-azure-vm-sizes in KV
  • OPENSANDBOX_EC2_INSTANCE_TYPES env var (CSV), or server-ec2-instance-types in KV

Empty values (the merge-day state) preserve pre-PR behavior — single-attempt CreateMachine at the pool's configured default.

Tests

Package What's covered
internal/compute/quota_test.go isAzureQuotaErr × 11 codes, isEC2QuotaErr × 7 codes, wrap helpers preserve original error chain
internal/config/config_test.go splitCSV × 7 edge cases, Load parses fallback lists
internal/controlplane/scaler_test.go Fallback skips quota errors, non-quota short-circuits, all-sizes-fail returns ErrQuotaExceeded, empty list uses pool default
internal/controlplane/redis_registry_test.go SetDraining writes/clears Redis with TTL; heartbeat applies external drain key; cross-CP propagation; placement skips drain marker (skip-on-no-Redis pattern)
internal/qemu/virtio_mem_test.go alignVirtioMemBlock × 13 cases incl. 7 GB wake-replug

Pre-existing TestDrainTimeoutCancelsDrainKeepsWorker failure is on main too (test fixture uses -20m against drainTimeout = 45m) — unrelated, not addressed here.

Live validation

Validated end-to-end on dev (opensandbox-prod RG, westus2):

  • TTL on routing:count:* keys reads ≤60 s after a placement.
  • 4 sandboxes spread 2/2 across two equally-loaded workers (pre-fix the +1 smoothing allowed all 4 onto one).
  • Admin drain → drain:{id} Redis key with TTL ≈ 86 400 s; subsequent placements all land on the non-drained worker; undrain DELs the key.
  • 1 GB sandbox scaled to 8 GB → hibernated → woken → guest sees 8 064 MB (pre-fix would have shown ~1 GB).

Backwards compat

  • Worker binary: not touched.
  • Existing snapshots: their meta.MemoryMB already records the pre-hibernate ceiling, so the new replug logic Just Works on existing data.
  • Routing changes: read-mostly; converge within one heartbeat. Mixed-version CPs interoperate during rollout.
  • Drain: pre-PR CPs don't read the Redis key, so they continue split-braining until upgraded. Post-PR CPs interoperate with old SetDraining calls via the local-mirror path. No flag day.
  • Machine-size fallback: opt-in via env / KV. Empty = pre-PR behavior.

@2027-evals
Copy link
Copy Markdown

2027-evals Bot commented May 7, 2026

⚠️ Couldn't find a preview deployment for commit b0ff762 after 10 minutes.

2027 auto-runs evals against preview deployments of your docs. To enable this, install one of:

  • Mintlify — if you use Mintlify docs
  • Vercel — for Next.js / static sites
  • Netlify — for most static docs

Once a preview is deployed, open a new PR and we'll run the eval automatically.


Evaluating agent experience using 2027.dev · View dashboard

@breardon2011 breardon2011 marked this pull request as ready for review May 7, 2026 19:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants