Skip to content

Commit 3e85151

Browse files
BiomeOS Developercursoragent
andcommitted
fix: early health responder on pre-bound socket — Wave 54
Health check unresponsive on southGate NUCLEUS: pre-bound socket accepted connections but accept loop didn't start until full handler was ready (~4-8s gap). Added spawn_early_health_responder() that responds to health.liveness/health.check/health.readiness immediately while executor initializes. BTSP not required for health probes. 3 new tests, 9,161+ lib tests, 0 clippy. Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent c836b07 commit 3e85151

13 files changed

Lines changed: 305 additions & 24 deletions

File tree

CHANGELOG.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,17 @@ All notable changes to ToadStool will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8-
## [Unreleased] - May 26, 2026 (Sessions 43-276)
8+
## [Unreleased] - May 27, 2026 (Sessions 43-277)
9+
10+
### Session S277 (May 27, 2026) — Wave 54: Early Health Responder
11+
12+
primalSpring Wave 54 response: health check unresponsive on southGate fixed.
13+
14+
- FIXED: Health probes unresponsive during startup — pre-bound socket accepted connections but nobody called accept() until full handler was ready (~4-8s gap)
15+
- ADDED: `spawn_early_health_responder()` — accepts connections on pre-bound socket immediately, responds to `health.liveness`/`health.check`/`health.readiness` while executor initializes
16+
- CHANGED: `serve_unix_prebound()` now takes `Arc<UnixListener>` — shared between early responder and full handler
17+
- DOCUMENTED: BTSP is NOT required for health probes (plaintext auto-detection), socket naming (TOADSTOOL_SOCKET env var override)
18+
- METRICS: 9,161+ lib tests, 0 clippy warnings
919

1020
### Session S276 (May 26, 2026) — Deep Debt Evolution II
1121

DEBT.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,21 @@
11
# Active Technical Debt Register
22

3-
**Date**: May 2026 — S276
3+
**Date**: May 2026 — S277
44
**Philosophy**: Math is universal, precision is silicon. Workarounds are
55
short-term solutions that increase debt. We aim to solve deep debt over
66
iterations, evolving toward vendor-agnostic, capability-based solutions—
77
with production stubs surfacing typed configuration errors and capability
88
guidance, and auth policy driven by explicit environment configuration
99
where applicable.
1010

11+
**S277 (Wave 54: Early Health Responder)**:
12+
Health check unresponsive on southGate NUCLEUS — pre-bound socket accepted
13+
connections but accept loop didn't start until full handler was ready (~4-8s).
14+
Added `spawn_early_health_responder()` that responds to health.liveness/
15+
health.check/health.readiness immediately on the pre-bound socket while
16+
executor initializes. BTSP NOT required for health probes (plaintext
17+
auto-detection). 9,161+ lib tests, 0 clippy.
18+
1119
**S275 (Wave 49: Ecosystem Tightening)**:
1220
Showcase fossilized (35 files → `fossilRecord/primals/toadStool/showcase_wave49/`).
1321
36 wateringHole handoffs mirrored to central (8 active, 28 archived). Stale

DOCUMENTATION.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# ToadStool Documentation Hub
22

3-
**Last Updated**: May 2026 — S276
3+
**Last Updated**: May 2026 — S277
44

55
---
66

@@ -30,11 +30,11 @@ These root documents were **fully resolved** and **fossilized** in wateringHole
3030

3131
---
3232

33-
## Current State (S276 — May 2026)
33+
## Current State (S277 — May 2026)
3434

3535
**Post-budding, dependency-sovereign, IPC-first, fully concurrent, capability-based.** barraCuda is a separate primal at `ecoPrimals/barraCuda/`. ToadStool is the hardware infrastructure layer — GPU/NPU/CPU discovery, capability probing, workload orchestration, and shader dispatch.
3636

37-
- **23,000+ tests** (9,158+ lib-only), 0 failures, 0 clippy warnings, 0 fmt diffs. Full workspace concurrent test suite.
37+
- **23,000+ tests** (9,161+ lib-only), 0 failures, 0 clippy warnings, 0 fmt diffs. Full workspace concurrent test suite.
3838
- **88 JSON-RPC methods** (direct) + semantic registry. Wire Standard L3 (partial): `cost_estimates`, `operation_dependencies`. **Recommended caller timeout: ≥3 seconds** for health probes during startup.
3939
- **Phase C complete** (S245–S253) — toadstool-cylinder (153 .rs, 700 tests), DRM/MMIO/AMD/NVIDIA/VFIO hardware modules absorbed from `coral-driver`. `OwnedFd` VFIO fd ownership (S253). SwapOrchestrator real quiesce/persist/restore (S253). `toadstool device` CLI with swap/list/status/warm subcommands (S253). GspBridge trait boundary.
4040
- **Phase D: Sovereign dispatch validated** (S250–S263) — `try_local_dispatch()` via `ComputeDevice` trait before `coral_client` IPC forward. Full buffer lifecycle. AMD DRM dispatch live. **NV VFIO e2e dispatch validated on Titan V** (S263): warm handoff → VFIO open → channel → DMA roundtrip → GR init. Current frontier: FECS PENDING_CTX_RELOAD.
@@ -44,13 +44,14 @@ These root documents were **fully resolved** and **fossilized** in wateringHole
4444
- **Sandbox working_dir production** (S269) — `data_dependencies` pre-dispatch validation with BLAKE3 integrity. `SandboxSpec.working_directory` wired into sandbox manager. 90+ upstream clippy errors absorbed.
4545
- **Deep Debt** (S240–S273) — All Duration literals extracted to named constants. `CORALREEF_*` env vars deprecated with `TOADSTOOL_*` primaries + deprecation warnings (S253). Zero `#[allow(deprecated)]` remaining. All lint attrs have `reason`. Zero production mocks/TODO/FIXME/unreachable!(). All unsafe SAFETY-documented. `cargo deny check bans` passes clean.
4646
- **Deep Debt Evolution** (S273) — Production panic surface eliminated (`kernel_health.rs`, dispatch cache, `ember_client.rs`, `secure_enclave`). `dispatch/mod.rs` 1,638→839L via `dispatch/sovereign.rs` extraction. `warm_init.rs` → module dir. 6 CLI `well_known::*` sites migrated to capability-based discovery. VFIO `activity_tracker().record()` wired. hw-safe abstractions validated.
47+
- **Wave 54: Early Health Responder** (S277) — Health check unresponsive on southGate fixed. Early health responder on pre-bound socket during startup. BTSP not required for health probes.
4748
- **Deep Debt Evolution II** (S276) — Remaining production unwrap/expect/unreachable eliminated. `handler/sovereign.rs` 1,003L → module directory. `memmap2` removed from hw-safe (rustix mmap). 3 primal-name type aliases deprecated. `ipc.register` capability list aligned to Node Atomic set.
4849
- **Capability-based everywhere**: 6 CLI hardcoded primal name sites migrated to capability-based discovery (S273); ~400 intentional legacy-compat refs remain (env fallbacks, serde aliases). 0 production mocks. All production logging via `tracing`.
4950
- **ecoBin v3.0** — Zero C FFI deps. `deny.toml` ring + async-trait + zstd-sys bans active.
5051
- **46 unsafe blocks** (all in hw-safe/GPU/VFIO/display/plugin containment crates); all SAFETY-documented. Workspace `unsafe_code = "deny"`, **41 crates `forbid`**.
5152
- **Dual-socket IPC**`compute.sock` (JSON-RPC primary) + `compute-tarpc.sock` (tarpc hot-path).
5253

53-
See [CHANGELOG.md](CHANGELOG.md) for full session-by-session history (S43–S276).
54+
See [CHANGELOG.md](CHANGELOG.md) for full session-by-session history (S43–S277).
5455

5556
---
5657

NEXT_STEPS.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
# ToadStool -- Next Steps
22

3-
**Updated**: May 2026 — S276 (Deep Debt Evolution II: unwrap elimination, sovereign split, memmap2 removal. 88+ JSON-RPC methods. 9,158+ lib tests.)
4-
**Status**: Production-grade | Rust edition **2024** (MSRV 1.85) | **AGPL-3.0-or-later** | **All quality gates green** | tests verified (23,000+ workspace, 0 failures; 9,158+ lib-only) | **88+ JSON-RPC methods** | Wire Standard L3 (partial) | Zero C FFI deps (ecoBin v3.0) | **Zero production panics/expects** | **Zero production TODO/FIXME/HACK** | **Zero production unreachable!()** | IPC-first | workspace `unsafe_code = "deny"`, **41 crates `forbid`** | **46 unsafe blocks** (all in hw containment, all SAFETY-documented) | **rustix 1.x workspace-wide** | **capability-based primal references (no hardcoded names)** | **`async-trait` DEPRECATED** (banned in `deny.toml`) | **`deny.toml` ring + async-trait + zstd-sys bans active** | **Zero external mmap deps (memmap2 removed S276)** | **Phase D dispatch live** | **E2E sovereign dispatch VALIDATED on Titan V (warm handoff)**
5-
**Latest**: S276**Deep Debt Evolution II**: Production unwrap/expect/unreachable eliminated. `handler/sovereign.rs` 1,003L → module directory. `memmap2` removed from hw-safe (rustix mmap). 3 primal-name aliases deprecated. `ipc.register` capabilities aligned to Node Atomic. 9,158+ lib tests.
6-
**Previous**: S275 — Wave 49 Ecosystem Tightening. S274 — Glacial Horizon: max_guest_load wired. S273 — Deep Debt Evolution. S268 — Kernel Health Preflight. S267 — Sovereign driver rotation.
3+
**Updated**: May 2026 — S277 (Wave 54: Early Health Responder. 88+ JSON-RPC methods. 9,161+ lib tests.)
4+
**Status**: Production-grade | Rust edition **2024** (MSRV 1.85) | **AGPL-3.0-or-later** | **All quality gates green** | tests verified (23,000+ workspace, 0 failures; 9,161+ lib-only) | **88+ JSON-RPC methods** | Wire Standard L3 (partial) | Zero C FFI deps (ecoBin v3.0) | **Zero production panics/expects** | **Zero production TODO/FIXME/HACK** | **Zero production unreachable!()** | IPC-first | workspace `unsafe_code = "deny"`, **41 crates `forbid`** | **46 unsafe blocks** (all in hw containment, all SAFETY-documented) | **rustix 1.x workspace-wide** | **capability-based primal references (no hardcoded names)** | **`async-trait` DEPRECATED** (banned in `deny.toml`) | **`deny.toml` ring + async-trait + zstd-sys bans active** | **Zero external mmap deps (memmap2 removed S276)** | **Phase D dispatch live** | **E2E sovereign dispatch VALIDATED on Titan V (warm handoff)**
5+
**Latest**: S277**Wave 54: Early Health Responder**: Health check unresponsive on southGate NUCLEUS fixed. Early health responder responds to health.liveness/health.check immediately on pre-bound socket while executor initializes. BTSP not required for health probes. 9,161+ lib tests.
6+
**Previous**: S276 — Deep Debt Evolution II. S275 — Wave 49 Ecosystem Tightening. S274 — Glacial Horizon: max_guest_load wired. S273 — Deep Debt Evolution. S268 — Kernel Health Preflight.
77

88
---
99

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# ToadStool
22

3-
**Sovereign Compute Hardware** | Pure Rust | ecoBin | May 2026 | S276 | v0.2.0
3+
**Sovereign Compute Hardware** | Pure Rust | ecoBin | May 2026 | S277 | v0.2.0
44

55
---
66

crates/server/src/pure_jsonrpc/connection/mod.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ mod tests;
1010
mod unix;
1111

1212
pub use tcp::serve_tcp;
13-
pub use unix::{prebind_unix_listener, serve_unix, serve_unix_prebound};
13+
pub use unix::{prebind_unix_listener, serve_unix, serve_unix_prebound, spawn_early_health_responder};
1414

1515
use crate::errors::{ServerError, ServerResult};
1616
use crate::pure_jsonrpc::types::JsonRpcError;

crates/server/src/pure_jsonrpc/connection/tests.rs

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -598,3 +598,85 @@ fn test_tcp_idle_timeout_env_override() {
598598
assert_eq!(timeout, std::time::Duration::from_secs(42));
599599
});
600600
}
601+
602+
// ═══════════════════════════════════════════════════════════
603+
// Early health responder (Wave 54)
604+
// ═══════════════════════════════════════════════════════════
605+
606+
#[tokio::test]
607+
async fn early_health_liveness_responds_alive() {
608+
let dir = tempfile::tempdir().unwrap();
609+
let sock = dir.path().join("test-early-health.sock");
610+
let listener = Arc::new(tokio::net::UnixListener::bind(&sock).unwrap());
611+
let (stop_tx, stop_rx) = tokio::sync::watch::channel(false);
612+
613+
let handle = super::spawn_early_health_responder(&listener, stop_rx);
614+
615+
let mut stream = UnixStream::connect(&sock).await.unwrap();
616+
stream
617+
.write_all(b"{\"jsonrpc\":\"2.0\",\"method\":\"health.liveness\",\"id\":1}\n")
618+
.await
619+
.unwrap();
620+
stream.flush().await.unwrap();
621+
622+
let mut buf = vec![0u8; 4096];
623+
let n = stream.read(&mut buf).await.unwrap();
624+
let resp: serde_json::Value = serde_json::from_slice(&buf[..n]).unwrap();
625+
assert_eq!(resp["result"]["status"], "alive");
626+
assert_eq!(resp["id"], 1);
627+
628+
let _ = stop_tx.send(true);
629+
handle.await.unwrap();
630+
}
631+
632+
#[tokio::test]
633+
async fn early_health_check_responds_starting() {
634+
let dir = tempfile::tempdir().unwrap();
635+
let sock = dir.path().join("test-early-check.sock");
636+
let listener = Arc::new(tokio::net::UnixListener::bind(&sock).unwrap());
637+
let (stop_tx, stop_rx) = tokio::sync::watch::channel(false);
638+
639+
let handle = super::spawn_early_health_responder(&listener, stop_rx);
640+
641+
let mut stream = UnixStream::connect(&sock).await.unwrap();
642+
stream
643+
.write_all(b"{\"jsonrpc\":\"2.0\",\"method\":\"health.check\",\"id\":42}\n")
644+
.await
645+
.unwrap();
646+
stream.flush().await.unwrap();
647+
648+
let mut buf = vec![0u8; 4096];
649+
let n = stream.read(&mut buf).await.unwrap();
650+
let resp: serde_json::Value = serde_json::from_slice(&buf[..n]).unwrap();
651+
assert_eq!(resp["result"]["status"], "starting");
652+
assert_eq!(resp["id"], 42);
653+
654+
let _ = stop_tx.send(true);
655+
handle.await.unwrap();
656+
}
657+
658+
#[tokio::test]
659+
async fn early_health_unknown_method_returns_error() {
660+
let dir = tempfile::tempdir().unwrap();
661+
let sock = dir.path().join("test-early-unknown.sock");
662+
let listener = Arc::new(tokio::net::UnixListener::bind(&sock).unwrap());
663+
let (stop_tx, stop_rx) = tokio::sync::watch::channel(false);
664+
665+
let handle = super::spawn_early_health_responder(&listener, stop_rx);
666+
667+
let mut stream = UnixStream::connect(&sock).await.unwrap();
668+
stream
669+
.write_all(b"{\"jsonrpc\":\"2.0\",\"method\":\"compute.submit\",\"id\":99}\n")
670+
.await
671+
.unwrap();
672+
stream.flush().await.unwrap();
673+
674+
let mut buf = vec![0u8; 4096];
675+
let n = stream.read(&mut buf).await.unwrap();
676+
let resp: serde_json::Value = serde_json::from_slice(&buf[..n]).unwrap();
677+
assert_eq!(resp["error"]["code"], -32002);
678+
assert!(resp["error"]["message"].as_str().unwrap().contains("initializing"));
679+
680+
let _ = stop_tx.send(true);
681+
handle.await.unwrap();
682+
}

crates/server/src/pure_jsonrpc/connection/unix.rs

Lines changed: 76 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ use super::process_request;
3333
///
3434
/// Returns [`ServerError`] if directory creation, socket bind, or permission setting fails.
3535
pub async fn serve_unix(handler: Arc<JsonRpcHandler>, socket_path: PathBuf) -> ServerResult<()> {
36-
let listener = prebind_unix_listener(&socket_path).await?;
36+
let listener = Arc::new(prebind_unix_listener(&socket_path).await?);
3737
serve_unix_prebound(handler, listener).await
3838
}
3939

@@ -93,13 +93,87 @@ pub async fn prebind_unix_listener(socket_path: &std::path::Path) -> ServerResul
9393
Ok(listener)
9494
}
9595

96+
/// Spawn a minimal health-only accept loop on a pre-bound listener.
97+
///
98+
/// Accepts connections and responds to `health.liveness` / `health.check` /
99+
/// `health.readiness` with immediate JSON-RPC responses while the full
100+
/// `JsonRpcHandler` is still being constructed. All other methods return
101+
/// a `-32002` "server initializing" error.
102+
///
103+
/// Returns a `JoinHandle` that resolves when `stop` receives a value. The
104+
/// caller should send to `stop` once the full handler is ready, then
105+
/// pass the same `listener` to [`serve_unix_prebound`].
106+
pub fn spawn_early_health_responder(
107+
listener: &Arc<UnixListener>,
108+
mut stop: tokio::sync::watch::Receiver<bool>,
109+
) -> tokio::task::JoinHandle<()> {
110+
let listener = Arc::clone(listener);
111+
tokio::spawn(async move {
112+
loop {
113+
tokio::select! {
114+
biased;
115+
_ = stop.changed() => break,
116+
result = listener.accept() => {
117+
match result {
118+
Ok((stream, _)) => {
119+
tokio::spawn(handle_early_health(stream));
120+
}
121+
Err(e) => {
122+
warn!("Early health accept error: {e}");
123+
}
124+
}
125+
}
126+
}
127+
}
128+
info!("Early health responder stopped — full handler taking over");
129+
})
130+
}
131+
132+
async fn handle_early_health(stream: UnixStream) {
133+
let (reader, mut writer) = stream.into_split();
134+
let mut reader = BufReader::new(reader);
135+
let mut line = String::new();
136+
if reader.read_line(&mut line).await.is_err() || line.trim().is_empty() {
137+
return;
138+
}
139+
let trimmed = line.trim();
140+
141+
let method = serde_json::from_str::<serde_json::Value>(trimmed)
142+
.ok()
143+
.and_then(|v| v.get("method")?.as_str().map(String::from));
144+
let id = serde_json::from_str::<serde_json::Value>(trimmed)
145+
.ok()
146+
.and_then(|v| v.get("id").cloned())
147+
.unwrap_or(serde_json::Value::Null);
148+
149+
let response = match method.as_deref() {
150+
Some("health.liveness") => {
151+
serde_json::json!({"jsonrpc":"2.0","result":{"status":"alive"},"id":id})
152+
}
153+
Some("health.check" | "toadstool.health" | "compute.health") => {
154+
serde_json::json!({"jsonrpc":"2.0","result":{"status":"starting","uptime_secs":0},"id":id})
155+
}
156+
Some("health.readiness") => {
157+
serde_json::json!({"jsonrpc":"2.0","result":{"status":"starting"},"id":id})
158+
}
159+
_ => {
160+
serde_json::json!({"jsonrpc":"2.0","error":{"code":-32002,"message":"Server initializing"},"id":id})
161+
}
162+
};
163+
164+
let mut buf = serde_json::to_vec(&response).unwrap_or_default();
165+
buf.push(b'\n');
166+
let _ = writer.write_all(&buf).await;
167+
let _ = writer.flush().await;
168+
}
169+
96170
/// Serve JSON-RPC on a pre-bound Unix socket listener.
97171
///
98172
/// Used with [`prebind_unix_listener`] to start accepting connections
99173
/// on a listener that was bound before the full handler was constructed.
100174
pub async fn serve_unix_prebound(
101175
handler: Arc<JsonRpcHandler>,
102-
listener: UnixListener,
176+
listener: Arc<UnixListener>,
103177
) -> ServerResult<()> {
104178
let env = toadstool_common::primal_sockets::SocketPathEnv::from_env();
105179
let btsp_required = toadstool_common::primal_sockets::is_btsp_required(&env);

crates/server/src/pure_jsonrpc/mod.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ mod connection;
2727
mod handler;
2828
mod types;
2929

30-
pub use connection::{prebind_unix_listener, process_request, serve_tcp, serve_unix, serve_unix_prebound};
30+
pub use connection::{prebind_unix_listener, process_request, serve_tcp, serve_unix, serve_unix_prebound, spawn_early_health_responder};
3131
pub use handler::HwLearnHandler;
3232
pub use handler::JsonRpcHandler;
3333
pub use types::{JsonRpcError, JsonRpcRequest, JsonRpcResponse, JsonWorkloadSubmission};

crates/server/src/unibin/execution.rs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -180,7 +180,7 @@ pub async fn start_servers_with_fallback(
180180
jsonrpc_socket: PathBuf,
181181
tcp_port: Option<u16>,
182182
cfg: &UnibinExecutionConfig,
183-
jsonrpc_listener: Option<tokio::net::UnixListener>,
183+
jsonrpc_listener: Option<Arc<tokio::net::UnixListener>>,
184184
) -> ServerResult<()> {
185185
if let Some(port) = tcp_port {
186186
info!(" --port {port} specified: starting TCP JSON-RPC (UniBin standard)");
@@ -216,7 +216,7 @@ async fn try_unix_servers(
216216
jsonrpc_handler: &Arc<JsonRpcHandler>,
217217
socket_path: &PathBuf,
218218
jsonrpc_socket: &PathBuf,
219-
jsonrpc_listener: Option<tokio::net::UnixListener>,
219+
jsonrpc_listener: Option<Arc<tokio::net::UnixListener>>,
220220
) -> ServerResult<()> {
221221
if let Some(parent) = socket_path.parent() {
222222
tokio::fs::create_dir_all(parent)

0 commit comments

Comments
 (0)