Skip to content

Commit 162470b

Browse files
feat: diesel engine driver sketch — GrInitSequence, WarmStateCapture, DriverProbe, PmuBootstrap, shared nv::pri
New cylinder modules for sovereign GPU driver replacement: - nv::pri: shared PRI fault detection (eliminates 4 duplicates) - nv::gr_init: GrInitSequence capture, replay, and validation - nv::driver_probe: multi-driver comparison lab (FalconState, TrialResult) - nv::pmu_init: Kepler PMU falcon bootstrap - vfio::warm_capture: cold/warm snapshot pipeline - PlxKeepalive/PlxGuardian for PLX D3cold prevention - sovereign_init kepler_pgraph_ungate rewired to GrInitSequence::apply() - ChipFamily::from_sm delegates to GenerationProfile 585+ tests pass, zero lint errors. Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent cf7e212 commit 162470b

37 files changed

Lines changed: 5699 additions & 123 deletions

NEXT_STEPS.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
# ToadStool -- Next Steps
22

3-
**Updated**: May 2026 — S263 (FECS CPUCTL_ALIAS breakthrough, GR context scheduler, warm handoff validated on Titan V)
3+
**Updated**: May 2026 — S266 (PlxKeepalive: continuous config space heartbeat for PLX-bridged devices. Root cause: PLX D3cold from **inactivity**, not swap events. ember `PlxKeepalive` + glowplug `PlxGuardian` fleet manager. 98 ember tests, 95 glowplug tests.)
44
**Status**: Production-grade | Rust edition **2024** (MSRV 1.85) | **AGPL-3.0-or-later** | **All quality gates green** | tests verified (22,900+ workspace, 0 failures; 8,849+ lib-only) | **83 JSON-RPC methods** | Wire Standard L3 (partial) | Zero C FFI deps (ecoBin v3.0) | **Zero production panics/expects** | **Zero production TODO/FIXME/HACK** | **Zero production unreachable!()** | IPC-first | workspace `unsafe_code = "deny"`, **41 crates `forbid`** | **46 unsafe blocks** (all in hw containment, all SAFETY-documented) | **rustix 1.x workspace-wide** | **capability-based primal references (no hardcoded names)** | **`async-trait` DEPRECATED** (banned in `deny.toml`) | **`deny.toml` ring + async-trait + zstd-sys bans active** | **Phase C complete — all blocking items resolved (S253)** | **Phase D dispatch live — QMD-based VFIO PBDMA dispatch wired (S258–S263)** | **`OwnedFd` VFIO fd ownership (S253)** | **`toadstool device` CLI (S253)** | **CORALREEF_* env vars deprecated with TOADSTOOL_* primaries (S253)** | **Zero `#[allow(deprecated)]` remaining** | **520 cylinder tests** | **E2E sovereign dispatch VALIDATED on Titan V (warm handoff)**
5-
**Latest**: S263**CPUCTL_ALIAS breakthrough**: discovered Volta HS falcons security-lock `CPUCTL` (0x100), always reading 0x10 (false HRESET). All probes migrated to `CPUCTL_ALIAS` (0x130) which reveals true running state. FECS confirmed alive throughout warm handoff. GR context buffer allocation + scheduler cycle (`resubmit_runlist`). NvGspBridge HS boot with corrected FBIF/DMACTL. Full e2e dispatch pipeline validated: warm handoff → VFIO open → channel create → DMA roundtrip → GR init. Current frontier: FECS PENDING_CTX_RELOAD (golden context mapping from VRAM).
6-
**Previous**: S262 — `device.gr.init` IPC. S261 — deep debt sweep. S259 — VFIO IPC + QMD dispatch. S258 — PBDMA dispatch wiring. S256–S257 — FECS warm-state + deep debt.
5+
**Latest**: S266**PLX Keepalive (Root Cause Fix)**: `PlxKeepalive` in ember — continuous config space heartbeat (CfgRd 0x00 every 5s) on PLX-bridged devices + full bridge chain. `KeepaliveHandle` for stop/status/heartbeat_count. `detect_plx_bridge()` checks vendor 0x10b5 in ancestry. `PlxGuardian` in glowplug — fleet-level auto-detection via `scan_and_protect()`, per-device `protect()/release()`, `status_summary()`. Root cause: PLX D3cold from inactivity (toadstool-server polling was accidental keepalive; bridge died in ~10min when it stopped).
6+
**Previous**: S265r — Driver Lab + Containment. S264 — PCIe bridge keepalive. S263 — CPUCTL_ALIAS breakthrough, GR context scheduler, warm handoff on Titan V. S262 — `device.gr.init` IPC. S261 — deep debt sweep. S259 — VFIO IPC + QMD dispatch. S258 — PBDMA dispatch wiring.
77

88
---
99

@@ -65,15 +65,17 @@ names directly. Deprecated API definitions retained for backward compatibility o
6565
| **Phase C: Multi-unit routing engine** | ✅ LANDED — `compute.route.multi_unit` handler, tolerance-based routing, heuristic fallback, shader-core fallback on every decision |
6666
| **Phase D: Mixed command streams** | Planned — blocked on coralReef FECS firmware loading; extends PBDMA with draw/RT/texture/tensor/framebuffer commands |
6767

68-
### Key Remaining Items (S263)
68+
### Key Remaining Items (S265)
6969

7070
| Item | Status |
7171
|------|--------|
7272
| Coverage push 83%→90% | Ongoing — hardware mocks needed for remaining gaps |
7373
| Phase D mixed command streams | Planned — requires coralReef FECS firmware loading |
7474
| VFIO PBDMA dispatch | **VALIDATED** (S258–S263) — GPFIFO + QMD dispatch works e2e on Titan V via warm handoff. FECS alive via CPUCTL_ALIAS. DMA roundtrip confirmed. |
75+
| PCIe bridge keepalive | **VALIDATED + EVOLVED** (S264→S266) — Phase 1 (S264): `pin_bridge_hierarchy()` + `SwapGuard` burst CfgRd during swaps. Phase 2 (S266): Root cause fix — PLX D3cold caused by **inactivity** (not swaps). `PlxKeepalive` (ember): continuous CfgRd every 5s on device + all upstream bridges. `PlxGuardian` (glowplug): fleet-level auto-detect via `scan_and_protect()`. 98 ember tests, 95 glowplug tests. |
7576
| E2E sovereign pipeline test | **VALIDATED** (S263) — warm handoff → VFIO open → channel → dispatch → readback. Pending: real shader execution (FECS PENDING_CTX_RELOAD frontier). |
7677
| FECS golden context mapping | **ACTIVE** — FECS scheduler stuck at PENDING_CTX_RELOAD. GR context buffer allocated but FECS needs golden context from VRAM. Next: map VRAM identity region or extract context init sequence from nouveau. |
78+
| No-FLR warm swap | **VALIDATED + IMPLEMENTED** (Exp 194, S265r) — `reset_method=""` disables FLR during vfio-pci bind. Titan V: 13/15 registers alive through nouveau→vfio-pci swap. Full BAR access verified. `WarmInitPlan` with containment architecture: bare-metal (nouveau, host-safe) vs contained (nvidia-470, agentReagents VM). `SysfsSwapExecutor::execute_warm_init()` bare-metal only — contained plans dispatch through agentReagents. Host DRM sacred. 77 tests pass. |
7779
| Phase 2 dep migration: procfs → toadstool-sysmon | **RESOLVED**`procfs` default features disabled (S129); dead `procfs` dep removed where unused (S160); runtime discovery uses `toadstool-sysmon` where applicable |
7880
| Phase 3: tarpc binary transport | **RESOLVED** S203t — MessagePack binary framing for Rust-to-Rust peers |
7981
| Property-based testing for computation modules | Pending |

0 commit comments

Comments
 (0)