Skip to content

Commit 677a425

Browse files
BiomeOS Developercursoragent
andcommitted
S274: Glacial horizon — yield-to-owner dispatch (max_guest_load enforcement)
Evolve max_guest_load from types-only (S269) to enforced. check_guest_load() in ResourceOrchestrator::check_quota() branches on YieldStrategy: - Queue: GuestLoadExceeded error, caller should retry after load drops - Reject: immediate rejection when GPU workloads >= max_concurrent_gpu - DeferUntilPowerCycle: error with retry-after-power-cycle guidance Add GuestLoadExceeded error variant (distinct from QuotaExceeded for yield-to-owner semantics). Re-export GuestLoadPolicy and YieldStrategy from crate root. Server dispatch wiring deferred pending flockGate spec. 10 new tests: strategy enforcement, serde roundtrip, release-reallocation, default validation, under-threshold pass, unlimited (None). 9,140 lib tests, 88 JSON-RPC methods, 0 clippy warnings, deny clean. Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent 78bb9ae commit 677a425

8 files changed

Lines changed: 332 additions & 9 deletions

File tree

CHANGELOG.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,18 @@ All notable changes to ToadStool will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8-
## [Unreleased] - May 24, 2026 (Sessions 43-273)
8+
## [Unreleased] - May 24, 2026 (Sessions 43-274)
9+
10+
### Session S274 (May 24, 2026) — Glacial Horizon: Yield-to-Owner Dispatch
11+
12+
primalSpring glacial horizon response: implement `max_guest_load` yield semantics for shared-hardware covalent deployments.
13+
14+
- ADDED: `check_guest_load()` enforcement in `ResourceOrchestrator::check_quota()` — branches on `YieldStrategy` (Queue, Reject, DeferUntilPowerCycle)
15+
- ADDED: `GuestLoadExceeded` error variant in `OrchestrationError` — distinct from `QuotaExceeded` for yield-to-owner semantics
16+
- ADDED: `GuestLoadPolicy` and `YieldStrategy` re-exported from `toadstool-runtime-orchestration` crate root
17+
- ADDED: 10 new tests — strategy enforcement (reject, queue, defer), under-threshold pass, unlimited (None), release-reallocation, default strategy validation, serde roundtrip, wire name verification
18+
- DEFERRED: Server dispatch wiring (`ResourceOrchestrator``DispatchHandler`) pending flockGate integration spec from upstream
19+
- METRICS: 88 JSON-RPC methods, 9,140 lib tests, 0 clippy warnings, deny clean
920

1021
### Session S273 (May 24, 2026) — Deep Debt Evolution: Panic Surface, Refactoring, Capability Discovery
1122

DEBT.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,22 @@
11
# Active Technical Debt Register
22

3-
**Date**: May 2026 — S273
3+
**Date**: May 2026 — S274
44
**Philosophy**: Math is universal, precision is silicon. Workarounds are
55
short-term solutions that increase debt. We aim to solve deep debt over
66
iterations, evolving toward vendor-agnostic, capability-based solutions—
77
with production stubs surfacing typed configuration errors and capability
88
guidance, and auth policy driven by explicit environment configuration
99
where applicable.
1010

11+
**S274 (Glacial Horizon: Yield-to-Owner Dispatch)**:
12+
`max_guest_load` yield semantics evolved from types-only (S269) to enforced:
13+
`check_guest_load()` in `ResourceOrchestrator::check_quota()` branches on
14+
`YieldStrategy` (Queue, Reject, DeferUntilPowerCycle). `GuestLoadExceeded`
15+
error variant added. `GuestLoadPolicy` and `YieldStrategy` re-exported from
16+
crate root. 10 new tests (strategy enforcement, serde roundtrip, release-
17+
reallocation, default validation). Server dispatch wiring deferred pending
18+
flockGate integration spec. 9,140 lib tests, 88 JSON-RPC methods, 0 clippy.
19+
1120
**S273 (Deep Debt Evolution — Panic Surface, Refactoring, Capability Discovery)**:
1221
Production panic surface eliminated: 29 `.unwrap()` in kernel_health.rs ELF
1322
parsing → `?` with `KernelHealthError::ElfParse`; `.expect("just inserted")`

NEXT_STEPS.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
# ToadStool -- Next Steps
22

3-
**Updated**: May 2026 — S273 (Deep Debt Evolution: production panic surface eliminated, dispatch/sovereign.rs extraction, warm_init module split, CLI capability-based discovery migration, VFIO activity tracking wired. 88 JSON-RPC methods. 9,131+ lib tests. 700 cylinder tests.)
4-
**Status**: Production-grade | Rust edition **2024** (MSRV 1.85) | **AGPL-3.0-or-later** | **All quality gates green** | tests verified (23,000+ workspace, 0 failures; 9,131+ lib-only) | **88 JSON-RPC methods** | Wire Standard L3 (partial) | Zero C FFI deps (ecoBin v3.0) | **Zero production panics/expects** | **Zero production TODO/FIXME/HACK** | **Zero production unreachable!()** | IPC-first | workspace `unsafe_code = "deny"`, **41 crates `forbid`** | **46 unsafe blocks** (all in hw containment, all SAFETY-documented) | **rustix 1.x workspace-wide** | **capability-based primal references (no hardcoded names)** | **`async-trait` DEPRECATED** (banned in `deny.toml`) | **`deny.toml` ring + async-trait + zstd-sys bans active** | **Phase C complete — all blocking items resolved (S253)** | **Phase D dispatch live — QMD-based VFIO PBDMA dispatch wired (S258–S263)** | **`OwnedFd` VFIO fd ownership (S253)** | **`toadstool device` CLI (S253)** | **CORALREEF_* env vars deprecated with TOADSTOOL_* primaries (S253)** | **Zero `#[allow(deprecated)]` remaining** | **700 cylinder tests** | **E2E sovereign dispatch VALIDATED on Titan V (warm handoff)**
5-
**Latest**: S273**Deep Debt Evolution**: Production panic surface eliminated — 29 `unwrap()` in `kernel_health.rs` → error propagation, dispatch cache `.expect()` `Result`, 5 `.expect()` in `ember_client.rs``?`, 2 fallible `Default` impls removed from `secure_enclave`. `dispatch/mod.rs` 1,638→839L via 7 sovereign handlers extracted to `dispatch/sovereign.rs` (814L). `warm_init.rs` 1,439L → module dir (`mod.rs` + `seeders.rs` + `trials.rs`). 6 CLI `well_known::*` sites migrated to capability-based discovery with legacy fallback. `activity_tracker().record()` wired into 7 VFIO dispatch paths. hw-safe abstractions validated; cylinder migration deferred.
6-
**Previous**: S268 — Kernel Health Preflight. S267 — Sovereign driver rotation. S266 — PLX keepalive root cause fix. S265r — Driver Lab + Containment. S264 — PCIe bridge keepalive. S263 — CPUCTL_ALIAS breakthrough, GR context scheduler, warm handoff on Titan V.
3+
**Updated**: May 2026 — S274 (Glacial Horizon: `max_guest_load` yield semantics enforced — `check_guest_load()` branches on Queue/Reject/DeferUntilPowerCycle. `GuestLoadExceeded` error. 10 new tests. 88 JSON-RPC methods. 9,140+ lib tests.)
4+
**Status**: Production-grade | Rust edition **2024** (MSRV 1.85) | **AGPL-3.0-or-later** | **All quality gates green** | tests verified (23,000+ workspace, 0 failures; 9,140+ lib-only) | **88 JSON-RPC methods** | Wire Standard L3 (partial) | Zero C FFI deps (ecoBin v3.0) | **Zero production panics/expects** | **Zero production TODO/FIXME/HACK** | **Zero production unreachable!()** | IPC-first | workspace `unsafe_code = "deny"`, **41 crates `forbid`** | **46 unsafe blocks** (all in hw containment, all SAFETY-documented) | **rustix 1.x workspace-wide** | **capability-based primal references (no hardcoded names)** | **`async-trait` DEPRECATED** (banned in `deny.toml`) | **`deny.toml` ring + async-trait + zstd-sys bans active** | **Phase C complete — all blocking items resolved (S253)** | **Phase D dispatch live — QMD-based VFIO PBDMA dispatch wired (S258–S263)** | **`OwnedFd` VFIO fd ownership (S253)** | **`toadstool device` CLI (S253)** | **CORALREEF_* env vars deprecated with TOADSTOOL_* primaries (S253)** | **Zero `#[allow(deprecated)]` remaining** | **700 cylinder tests** | **E2E sovereign dispatch VALIDATED on Titan V (warm handoff)**
5+
**Latest**: S274**Glacial Horizon: Yield-to-Owner Dispatch**: `max_guest_load` yield semantics evolved from types-only (S269) to enforced. `check_guest_load()` in `ResourceOrchestrator::check_quota()` branches on `YieldStrategy` (Queue, Reject, DeferUntilPowerCycle). `GuestLoadExceeded` error variant. `GuestLoadPolicy`/`YieldStrategy` re-exported from crate root. 10 new tests. Server dispatch wiring deferred pending flockGate integration spec.
6+
**Previous**: S273 — Deep Debt Evolution: panic surface, refactoring, capability discovery. S268 — Kernel Health Preflight. S267 — Sovereign driver rotation. S266 — PLX keepalive. S265r — Driver Lab. S264 — PCIe bridge keepalive. S263 — warm handoff on Titan V.
77

88
---
99

@@ -105,7 +105,7 @@ names directly. Deprecated API definitions retained for backward compatibility o
105105
| Item | Priority | Status |
106106
|------|----------|--------|
107107
| `compute.fan_out` at scale — Tenaillon 590 GB batch | MEDIUM | **RE-IMPLEMENTED** (S269) — handler, types, 10 tests, wire L3, semantic aliases. strandGate graph design pending upstream spec. |
108-
| `max_guest_load` yield semantics — power-cycle scheduling for flockGate | LOW | **TYPES SHIPPED** (S269) — `GuestLoadPolicy` + `YieldStrategy` on `TenantQuota`. Orchestrator enforcement pending flockGate integration spec. |
108+
| `max_guest_load` yield semantics — power-cycle scheduling for flockGate | LOW | **ENFORCED** (S274) — `check_guest_load()` branches on `YieldStrategy` (Queue/Reject/DeferUntilPowerCycle). `GuestLoadExceeded` error. 10 tests. Server dispatch wiring pending flockGate integration spec. |
109109

110110
### Key Remaining Items (S268)
111111

crates/runtime/orchestration/src/error.rs

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,16 @@ pub enum OrchestrationError {
3030
#[error("Quota exceeded: {0}")]
3131
QuotaExceeded(String),
3232

33+
/// Guest load exceeds threshold — workload yielded to owner.
34+
///
35+
/// Returned when `max_guest_load` policy is active and the current
36+
/// GPU-bound workload count exceeds `max_concurrent_gpu`. The yield
37+
/// strategy determines the action: `Queue` defers, `Reject` fails
38+
/// immediately, `DeferUntilPowerCycle` waits for a host power-cycle
39+
/// window to complete.
40+
#[error("Guest load exceeded: {0}")]
41+
GuestLoadExceeded(String),
42+
3343
/// Internal lock was poisoned by a prior panic.
3444
#[error("Internal lock poisoned: {0}")]
3545
LockPoisoned(String),

crates/runtime/orchestration/src/lib.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -87,8 +87,8 @@ pub use orchestrator::{
8787
};
8888
pub use policy::SelectionPolicy;
8989
pub use resource_orchestrator::{
90-
AvailableDevice, DeploymentModel, ResourceAllocation, ResourceOrchestrator, ResourceRequest,
91-
TenantQuota, TenantUsage,
90+
AvailableDevice, DeploymentModel, GuestLoadPolicy, ResourceAllocation, ResourceOrchestrator,
91+
ResourceRequest, TenantQuota, TenantUsage, YieldStrategy,
9292
};
9393
pub use scheduler::{ExecutionSchedule, ScheduledTask, SchedulingStrategy, WorkloadScheduler};
9494
pub use workload_health::{

crates/runtime/orchestration/src/resource_orchestrator.rs

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -445,10 +445,48 @@ impl ResourceOrchestrator {
445445
request.tenant_id, quota.max_devices
446446
)));
447447
}
448+
449+
if let Some(ref policy) = quota.max_guest_load {
450+
self.check_guest_load(request, current, policy)?;
451+
}
448452
}
449453

450454
Ok(())
451455
}
456+
457+
/// Check guest-load policy. When GPU-bound workloads exceed the
458+
/// threshold, apply the configured yield strategy.
459+
fn check_guest_load(
460+
&self,
461+
request: &ResourceRequest,
462+
current: &TenantUsage,
463+
policy: &GuestLoadPolicy,
464+
) -> Result<(), OrchestrationError> {
465+
let gpu_workloads = u32::try_from(current.device_allocations.len()).unwrap_or(u32::MAX);
466+
if gpu_workloads < policy.max_concurrent_gpu {
467+
return Ok(());
468+
}
469+
470+
match policy.yield_strategy {
471+
YieldStrategy::Reject => Err(OrchestrationError::GuestLoadExceeded(format!(
472+
"Tenant {} rejected: {} GPU workloads >= max_concurrent_gpu {} (strategy: reject)",
473+
request.tenant_id, gpu_workloads, policy.max_concurrent_gpu
474+
))),
475+
YieldStrategy::Queue => Err(OrchestrationError::GuestLoadExceeded(format!(
476+
"Tenant {} queued: {} GPU workloads >= max_concurrent_gpu {} (strategy: queue — \
477+
caller should retry after load drops)",
478+
request.tenant_id, gpu_workloads, policy.max_concurrent_gpu
479+
))),
480+
YieldStrategy::DeferUntilPowerCycle => {
481+
Err(OrchestrationError::GuestLoadExceeded(format!(
482+
"Tenant {} deferred: {} GPU workloads >= max_concurrent_gpu {} \
483+
(strategy: defer_until_power_cycle — caller should retry after \
484+
host power-cycle window)",
485+
request.tenant_id, gpu_workloads, policy.max_concurrent_gpu
486+
)))
487+
}
488+
}
489+
}
452490
}
453491

454492
#[cfg(test)]

crates/runtime/orchestration/src/resource_orchestrator_tests.rs

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -150,3 +150,171 @@ fn test_free_vram_saturates() {
150150
};
151151
assert_eq!(dev.free_vram_bytes(), 0);
152152
}
153+
154+
#[test]
155+
fn test_guest_load_reject_strategy() {
156+
let orch = ResourceOrchestrator::new(DeploymentModel::LocalMulti, two_gpu_devices());
157+
orch.register_tenant(
158+
"guest-a",
159+
TenantQuota {
160+
max_guest_load: Some(GuestLoadPolicy {
161+
max_concurrent_gpu: 1,
162+
yield_strategy: YieldStrategy::Reject,
163+
}),
164+
..Default::default()
165+
},
166+
)
167+
.unwrap();
168+
169+
let req = test_request("guest-a", 3);
170+
let _alloc1 = orch.allocate(&req).unwrap();
171+
let result = orch.allocate(&req);
172+
assert!(result.is_err());
173+
let err = result.unwrap_err().to_string();
174+
assert!(err.contains("strategy: reject"), "got: {err}");
175+
}
176+
177+
#[test]
178+
fn test_guest_load_queue_strategy() {
179+
let orch = ResourceOrchestrator::new(DeploymentModel::LocalMulti, two_gpu_devices());
180+
orch.register_tenant(
181+
"guest-b",
182+
TenantQuota {
183+
max_guest_load: Some(GuestLoadPolicy {
184+
max_concurrent_gpu: 1,
185+
yield_strategy: YieldStrategy::Queue,
186+
}),
187+
..Default::default()
188+
},
189+
)
190+
.unwrap();
191+
192+
let req = test_request("guest-b", 3);
193+
let _alloc1 = orch.allocate(&req).unwrap();
194+
let result = orch.allocate(&req);
195+
assert!(result.is_err());
196+
let err = result.unwrap_err().to_string();
197+
assert!(err.contains("strategy: queue"), "got: {err}");
198+
}
199+
200+
#[test]
201+
fn test_guest_load_defer_power_cycle_strategy() {
202+
let orch = ResourceOrchestrator::new(DeploymentModel::LocalMulti, two_gpu_devices());
203+
orch.register_tenant(
204+
"guest-c",
205+
TenantQuota {
206+
max_guest_load: Some(GuestLoadPolicy {
207+
max_concurrent_gpu: 1,
208+
yield_strategy: YieldStrategy::DeferUntilPowerCycle,
209+
}),
210+
..Default::default()
211+
},
212+
)
213+
.unwrap();
214+
215+
let req = test_request("guest-c", 3);
216+
let _alloc1 = orch.allocate(&req).unwrap();
217+
let result = orch.allocate(&req);
218+
assert!(result.is_err());
219+
let err = result.unwrap_err().to_string();
220+
assert!(
221+
err.contains("strategy: defer_until_power_cycle"),
222+
"got: {err}"
223+
);
224+
}
225+
226+
#[test]
227+
fn test_guest_load_under_threshold_passes() {
228+
let orch = ResourceOrchestrator::new(DeploymentModel::LocalMulti, two_gpu_devices());
229+
orch.register_tenant(
230+
"guest-d",
231+
TenantQuota {
232+
max_guest_load: Some(GuestLoadPolicy {
233+
max_concurrent_gpu: 2,
234+
yield_strategy: YieldStrategy::Reject,
235+
}),
236+
..Default::default()
237+
},
238+
)
239+
.unwrap();
240+
241+
let req = test_request("guest-d", 3);
242+
let alloc1 = orch.allocate(&req).unwrap();
243+
assert!(!alloc1.exclusive);
244+
}
245+
246+
#[test]
247+
fn test_guest_load_none_means_unlimited() {
248+
let orch = ResourceOrchestrator::new(DeploymentModel::LocalMulti, two_gpu_devices());
249+
orch.register_tenant(
250+
"guest-e",
251+
TenantQuota {
252+
max_guest_load: None,
253+
..Default::default()
254+
},
255+
)
256+
.unwrap();
257+
258+
let req = test_request("guest-e", 3);
259+
let _alloc1 = orch.allocate(&req).unwrap();
260+
let alloc2 = orch.allocate(&req);
261+
assert!(alloc2.is_ok());
262+
}
263+
264+
#[test]
265+
fn test_guest_load_release_allows_reallocation() {
266+
let orch = ResourceOrchestrator::new(DeploymentModel::LocalMulti, two_gpu_devices());
267+
orch.register_tenant(
268+
"guest-f",
269+
TenantQuota {
270+
max_guest_load: Some(GuestLoadPolicy {
271+
max_concurrent_gpu: 1,
272+
yield_strategy: YieldStrategy::Reject,
273+
}),
274+
..Default::default()
275+
},
276+
)
277+
.unwrap();
278+
279+
let req = test_request("guest-f", 3);
280+
let alloc1 = orch.allocate(&req).unwrap();
281+
let result = orch.allocate(&req);
282+
assert!(result.is_err());
283+
284+
orch.release("guest-f", alloc1.device_index).unwrap();
285+
let alloc2 = orch.allocate(&req);
286+
assert!(alloc2.is_ok());
287+
}
288+
289+
#[test]
290+
fn test_guest_load_default_strategy_is_queue() {
291+
assert_eq!(YieldStrategy::default(), YieldStrategy::Queue);
292+
}
293+
294+
#[test]
295+
fn test_guest_load_policy_serde_roundtrip() {
296+
let policy = GuestLoadPolicy {
297+
max_concurrent_gpu: 4,
298+
yield_strategy: YieldStrategy::DeferUntilPowerCycle,
299+
};
300+
let json = serde_json::to_string(&policy).unwrap();
301+
let parsed: GuestLoadPolicy = serde_json::from_str(&json).unwrap();
302+
assert_eq!(parsed.max_concurrent_gpu, 4);
303+
assert_eq!(parsed.yield_strategy, YieldStrategy::DeferUntilPowerCycle);
304+
}
305+
306+
#[test]
307+
fn test_yield_strategy_serde_names() {
308+
assert_eq!(
309+
serde_json::to_string(&YieldStrategy::Queue).unwrap(),
310+
"\"queue\""
311+
);
312+
assert_eq!(
313+
serde_json::to_string(&YieldStrategy::Reject).unwrap(),
314+
"\"reject\""
315+
);
316+
assert_eq!(
317+
serde_json::to_string(&YieldStrategy::DeferUntilPowerCycle).unwrap(),
318+
"\"defer_until_power_cycle\""
319+
);
320+
}

0 commit comments

Comments
 (0)