scaler fixes#209
Open
breardon2011 wants to merge 3 commits intomainfrom
Open
Conversation
|
2027 auto-runs evals against preview deployments of your docs. To enable this, install one of:
Once a preview is deployed, open a new PR and we'll run the eval automatically. Evaluating agent experience using 2027.dev · View dashboard |
motatoes
approved these changes
May 7, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Scaler routing + lifecycle fixes
Closes a cluster of placement-imbalance, drain-coordination, and post-wake memory-accounting issues. Companion to #207 — that one fixed the worker-side reliability story; this one fixes the control-plane side (placement, drain, scale-up) plus a wake-path memory bug whose tail was hitting worker accounting.
All control-plane changes are pure orchestration — no
goldenVersionchange, no rootfs rebuild. The wake-path change rides any existing rootfs since it doesn't change the snapshot format.What changes
routing:countTTLredis_registry.goCurrent <= minCount+1— 60-vs-59 split round-robined → no preference for lighter worker1e-9)redis_registry.goCurrent/Capacity + MemPct/100(actual RSS, not committed)redis_registry.goredis_registry.goCurrent == CapacityworkersCurrent >= Capacityredis_registry.goredis_registry.go(collectEligibleLocked)drain:{workerID}with 24 h TTL;handleHeartbeatreads + applies;SetDrainingwrites/DELsredis_registry.go,admin_workers.gocompute.ErrQuotaExceededsentinel; Azure (7 codes) + EC2 (5 codes) wrap on detect; scaler retries throughMachineSizesranked list, short-circuits on non-quota errors, only counts a creation failure once the whole list is exhaustedscaler.go,compute/{pool,azure,ec2}.go,config/config.go,config/keyvault.go,cmd/server/main.govm.MemoryMBstamped to ceiling withoutqom-set— guest capped atbaseMem, phantom committed memoryLoadVMandCont;vm.MemoryMBandvm.virtioMemRequestedMBnow reflect actually plugged bytesqemu/snapshot.go,qemu/manager.gohostUsedMemoryMB()(real RSS, MemTotal − MemAvailable) with 20% safety margin; mirrorsPrepareMigrationIncoming's actual-memory gateqemu/manager.go(SetResourceLimits)((x + 127) / 128) * 128open-coded at 5 sitesalignVirtioMemBlock(mb int) inthelper +virtioMemBlockSizeMBconstqemu/manager.go,qemu/snapshot.goConfiguration knobs
The machine-size fallback ships dormant. To activate it post-merge:
OPENSANDBOX_AZURE_VM_SIZESenv var (CSV), orserver-azure-vm-sizesin KVOPENSANDBOX_EC2_INSTANCE_TYPESenv var (CSV), orserver-ec2-instance-typesin KVEmpty values (the merge-day state) preserve pre-PR behavior — single-attempt
CreateMachineat the pool's configured default.Tests
internal/compute/quota_test.goisAzureQuotaErr× 11 codes,isEC2QuotaErr× 7 codes, wrap helpers preserve original error chaininternal/config/config_test.gosplitCSV× 7 edge cases,Loadparses fallback listsinternal/controlplane/scaler_test.goErrQuotaExceeded, empty list uses pool defaultinternal/controlplane/redis_registry_test.goSetDrainingwrites/clears Redis with TTL; heartbeat applies external drain key; cross-CP propagation; placement skips drain marker (skip-on-no-Redis pattern)internal/qemu/virtio_mem_test.goalignVirtioMemBlock× 13 cases incl. 7 GB wake-replugPre-existing
TestDrainTimeoutCancelsDrainKeepsWorkerfailure is onmaintoo (test fixture uses-20magainstdrainTimeout = 45m) — unrelated, not addressed here.Live validation
Validated end-to-end on dev (
opensandbox-prodRG, westus2):routing:count:*keys reads ≤60 s after a placement.drain:{id}Redis key with TTL ≈ 86 400 s; subsequent placements all land on the non-drained worker; undrain DELs the key.Backwards compat
meta.MemoryMBalready records the pre-hibernate ceiling, so the new replug logic Just Works on existing data.SetDrainingcalls via the local-mirror path. No flag day.