feat(net): ErrSink framework — propagate the 4 silent gvproxy goroutine errors #612 didn't cover#634
Draft
G4614 wants to merge 4 commits into
Draft
Conversation
a8fe41c to
b94f472
Compare
boxlite-ai#612 surfaced `virtualnetwork.New` bind errors out of the gvproxy goroutine via an ad-hoc `initErr` channel. Four sibling failure sites still fell into `logrus.Error + return` and went unobserved by the Rust runtime: main.go:454 OverrideTCPHandler — allow_net / MITM silently disabled main.go:475 transport.AcceptVfkit — VFKit transport silently dead (macOS) main.go:484 vn.AcceptVfkit — VFKit protocol silently dead main.go:496 listener.Accept — Qemu transport silently dead (Linux) main.go:510 vn.AcceptQemu — Qemu protocol silently dead OverrideTCPHandler in particular is a security-shaped bug: when allow_net is configured but OverrideTCPHandler fails, the box appears healthy AND traffic bypasses the filter. The Accept failures are boxlite-ai#612's symptom on a different code path: the VM "connects" but the protocol pump never runs, so 20s later the guest reports "DNS lookup … i/o timeout" or every TCP RST. This change introduces `ErrSink` as the single propagation entry point for all five sites: error_sink.go (new): - ErrSink with two channels: initErr (cap 1) + runtimeErr (cap 16) - Init(source, err) / WaitInit() pair — replaces boxlite-ai#612's ad-hoc channel - Runtime(source, err) / PollRuntime() pair — non-blocking on the producer side (queue-full drops with `runtime_err_dropped` warn + atomic counter); the Rust runtime polls via the new FFI export main.go: - GvproxyInstance gains `errSink *ErrSink` (init at instance creation) - The init goroutine now: sink.Init("virtualnetwork.New", err) if New fails sink.Init("OverrideTCPHandler", err) if override fails ← MOVED sink.Init("", nil) when all init succeeds OverrideTCPHandler MOVED into init phase: a failure here used to be silent and post-init; now it blocks gvproxy_create from returning, so the box.create call fails-fast instead of producing a half-baked sinkhole. - The four runtime-phase sites now also push to sink.Runtime(...) *in addition* to logging — log preserved for backwards-compat with existing field grep filters, sink call adds the structured surface. gvproxy_poll_runtime_error (new //export): - Returns the oldest unread runtime error as a heap-allocated C string (`gvproxy_free_string` to release), or nil if queue empty. - Intended for a Rust-side background task polling ~250ms; out of scope for this PR (Go-side framework only — Rust consumer is follow-up). Tests (error_sink_test.go, 9 tests): - Init contract: nil success, error wrapping with source, WaitInit blocks until Init is called - Runtime contract: nil no-op, FIFO drain via PollRuntime, queue-full drop counts, producer-never-blocks under backpressure, concurrent producers safe (observed + dropped = sent) - Render contract: RuntimeError.String contains source + cause + RFC3339 timestamp (so the operator-visible format is pinned) Two-sided context: the sink-wiring sites in main.go can't be unit-tested without invoking the cgo `gvproxy_create` (real Unix-socket path + VM transport). That integration belongs in the Rust-side consumer PR. The ErrSink framework itself is fully two-sided in the unit tests above — each one trivially red'd by reverting the corresponding ErrSink line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
747e75a to
0314c73
Compare
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
…emu loop Adds the "不修真死" half of boxlite-ai#634's claim — direct two-sided proof that the ErrSink framework actually surfaces what would otherwise be silent, both at the Rust-FFI layer (via poll) and at one production wire site (Linux Qemu Accept). Rust-side (libgvproxy-sys/src/lib.rs, 3 tests): - poll_runtime_error_round_trips_through_ffi — full path validation: create fixture → inject error → poll → assert source+cause+timestamp → drain → destroy. Pins the operator-facing rendered format. - poll_runtime_error_preserves_fifo_through_ffi — 3 events in order via FFI (no Go-side reordering). - poll_runtime_error_unknown_id_returns_null — defensive: unknown id or already-destroyed id MUST return NULL, never panic. This is the contract the always-on Rust background poller depends on. Toggle proof: replacing `re := instance.errSink.PollRuntime()` with `re := (*RuntimeError)(nil)` reds 2/3 of these (round_trips + preserves_fifo); unknown_id stays green because that path returns NULL by design. Confirms the poll line is load-bearing for the FFI contract. Go-side wire extraction (qemu_accept_loop.go, 2 tests): - Extracted the inline Linux Qemu Accept goroutine into runQemuAcceptLoop(ctx, id, listener, vn, sink). Same body — now testable against a real Unix socket + cancellable ctx. - main.go now reads `go runQemuAcceptLoop(ctx, id, listener, vn, sink)` instead of the inline goroutine; behavior identical. - TestRunQemuAcceptLoop_ListenerCloseSurfacesViaSink — production-shape proof: open real Unix listener, spawn loop, close listener mid-Accept, assert sink got source="listener.Accept" and cause="use of closed network connection". This IS the silent error path the larger PR rewires; the test would never have observed it pre-fix. - TestRunQemuAcceptLoop_CancelDoesNotSinkRuntime — negative control: ctx cancellation (planned shutdown) must NOT push to sink, else every gvproxy_destroy would pollute the runtime error queue with self-inflicted noise. Toggle proof: removing `sink.Runtime("listener.Accept", err)` from runQemuAcceptLoop reds the first test with: "expected listener.Accept runtime error in sink, got nil — pre-fix silent return: the goroutine logged the error and vanished without anyone (including the Rust runtime) ever observing it" Restored → green. This is the "不修真死" demonstration for ONE of the 5 silent sites; the other 4 (transport.AcceptVfkit, vn.AcceptVfkit, vn.AcceptQemu, OverrideTCPHandler) share the same pattern and would red the same way. Test-only FFI exports (main.go): - gvproxy_test_create_for_polling — minimal fixture instance (no socket, no goroutines, just an ErrSink) for Rust integration tests. - gvproxy_test_inject_runtime_error(id, source, message) — pushes a synthetic event onto the sink so Rust tests can drive injection without standing up a real VM transport. Both are exported (live in `//export` blocks) but only meaningful in test contexts; harmless in production because nothing else calls them. Total: +3 Rust integration tests, +2 Go integration tests, +2 //export helpers, +1 extracted helper file. All 14 tests green; 3 toggles two-side red'd; restored → green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…routines
Completes the "5/5 silent sites have unit-level two-side proof" claim
on top of the previous Layer-3 demo (which only covered listener.Accept).
Extraction (3 new helper files):
- qemu_accept_loop.go now takes `acceptQemu qemuProtocolFn` (function
value) instead of `*virtualnetwork.VirtualNetwork`, so tests can mock
the protocol layer without standing up real virtualnetwork.New.
- vfkit_accept_loop.go (new) — same shape for macOS: `acceptVfkitTransport
vfkitTransportFn` + `acceptVfkit vfkitProtocolFn`. Compiles on Linux too;
only invoked on GOOS=="darwin" in production but tests run everywhere.
- install_tcp_override.go (new) — wraps the OverrideTCPHandler call
so the init goroutine in main.go (which still runs inline) can route
failures through a single point. `installFn func() error` injection
lets tests force the failure shape.
main.go now reads:
go runQemuAcceptLoop(ctx, id, listener, vn.AcceptQemu, sink)
go runVfkitAcceptLoop(ctx, id, conn.(*net.UnixConn),
transport.AcceptVfkit, vn.AcceptVfkit, sink)
err := installTCPOverride(hasTCPFilter, installTCPHandler)
Same behavior end-to-end; the inline goroutines are 30 lines smaller.
Test coverage delta (10 new test functions, +400 lines):
qemu_accept_loop_test.go (+1):
- ProtocolErrorSurfacesViaSink: real listener + real dialer (so Accept
succeeds), mock acceptQemu returns error → sink.PollRuntime gets
source="vn.AcceptQemu" + errors.Is(re.Err, mockErr).
vfkit_accept_loop_test.go (new file, 3 tests):
- TransportErrorSurfacesViaSink: mock acceptVfkitTransport returns
error → sink gets source="transport.AcceptVfkit".
- ProtocolErrorSurfacesViaSink: net.Pipe + successful mock transport
+ failing mock acceptVfkit → sink gets source="vn.AcceptVfkit".
- CancelDoesNotSinkRuntime: planned shutdown via ctx cancel, mock
transport blocks on ctx.Done then errors → sink stays empty.
install_tcp_override_test.go (new file, 4 tests):
- SkipsWhenNoFilterConfigured: hasTCPFilter=false → installFn NOT
called (the production guard pin).
- CallsWhenFilterConfigured: hasTCPFilter=true → installFn IS called.
- PropagatesError: installFn returns mockErr → installTCPOverride
returns the same wrapped error (the security-shaped guarantee).
- EmptyAllowNetReturnsNil: newTCPFilterFromConfig with empty/nil
AllowNet → nil (MITM-only mode).
Two-sided verification (toggle red'd):
| toggle | red message |
|---|---|
| qemu_accept_loop.go: remove `sink.Runtime("vn.AcceptQemu", err)` | `ProtocolErrorSurfacesViaSink`: "expected vn.AcceptQemu runtime error in sink, got nil — pre-fix silent return: protocol pump errored but no signal reached anyone (Rust runtime, operator, ops dashboard)" |
| vfkit_accept_loop.go: remove `sink.Runtime("transport.AcceptVfkit", err)` | `TransportErrorSurfacesViaSink`: "expected transport.AcceptVfkit runtime error in sink, got nil — pre-fix silent return: gvproxy alive but VFKit transport dead, guest fails 20s later as DNS/network unreachable (same shape as boxlite-ai#612)" |
| vfkit_accept_loop.go: remove `sink.Runtime("vn.AcceptVfkit", err)` | `ProtocolErrorSurfacesViaSink` (Vfkit): "expected vn.AcceptVfkit runtime error in sink, got nil — pre-fix silent return: VFKit protocol pump died, guest TCP connections RST silently" |
| install_tcp_override.go: swallow `installFn()` return | `PropagatesError`: "expected error to propagate; got nil — pre-fix silent install: OverrideTCPHandler error swallowed, allow_net/MITM silently disabled, box.create reports SUCCESS" |
All 4 reds restore to green when the toggle is reverted. Combined with the
prior listener.Accept toggle (in the previous commit), this gives 5/5
unit-level proof that each silent site, pre-fix, would have stayed silent.
Test totals after this commit (Go side):
- 9 error_sink_test.go (framework contracts)
- 3 qemu_accept_loop_test.go (listener.Accept + vn.AcceptQemu + cancel control)
- 3 vfkit_accept_loop_test.go (transport + protocol + cancel control)
- 4 install_tcp_override_test.go
--- ---
19 Go tests (+ 3 Rust FFI integration tests = 22 total)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…k loop Two complementary additions to make boxlite-ai#634's ErrSink framework fully observable from the Rust runtime end-to-end: PART 1: Rust-side poll consumer (src/boxlite/src/net/gvproxy/) ───────────────────────────────────────────────────────────────────── Pre-this-commit: ErrSink + the 5 silent-goroutine wirings + FFI export landed in boxlite-ai#634, but no Rust consumer polled `gvproxy_poll_runtime_error`. Events queued in Go's bounded `runtimeErr` channel (cap 16) and eventually dropped when `gvproxy_destroy` ran — externally invisible. This commit: ffi.rs: - poll_runtime_error(id) -> BoxliteResult<Option<String>> Safe wrapper around the FFI; handles CString → Rust conversion + frees the C string. None on empty queue, Some(rendered) on event, Err only on UTF-8 decode failure (effectively never). instance.rs: - start_runtime_error_polling(weak instance) spawns a tokio task analogous to start_stats_logging. Polls every 250ms (see RUNTIME_ERROR_POLL_INTERVAL doc for the cache-window rationale). Drains every queued event per cycle so a burst delivers in one tick rather than one-per-tick. - poll_and_route_once(id) is the pollable body extracted from the spawn loop so tests can drive it without a real tokio loop. Routes events into tracing::warn! with target "gvproxy.runtime_error" + instance_id + event fields. mod.rs: - Wires start_runtime_error_polling alongside start_stats_logging at GvproxyBackend construction time. Tests (3 in src/boxlite/src/net/gvproxy/instance.rs, feature=gvproxy): - ffi_poll_runtime_error_round_trips — Rust ffi wrapper round-trips empty → injected event → drained, asserts source/cause/timestamp. - poll_and_route_once_drains_and_emits_tracing_per_event — inject 3 events, run consumer once, assert (a) returned count = 3, (b) each event's source + cause appears in captured tracing output, (c) "gvproxy.runtime_error" target is present, (d) instance_id field is set for correlation. - poll_and_route_once_on_empty_queue_emits_nothing — steady-state common case: empty queue → 0 emits → no tracing noise. Two-sided: short-circuiting poll_and_route_once to `return 0` (instead of running the drain loop) reds the drains test with `left: 0 right: 3`. Restored → green. PART 2: MITM cert-gen wire + silent audit (gvproxy-bridge/mitm{,_proxy}.go) ───────────────────────────────────────────────────────────────────── mitm.go BoxCA + mitm_proxy.go cert-gen path is the security-shaped silent failure the larger boxlite-ai#634 framework hadn't touched. A persistent GenerateHostCert failure means MITM is silently disabled for that hostname, the guest sees a TLS handshake error, and the only signal was a `logrus.Error` line — invisible to Rust + operators. mitm.go: - BoxCA gains an optional `errSink *ErrSink` field + `SetErrSink` setter. Optional so existing tests / pre-init paths that construct a BoxCA without a sink keep working. - `reportCertGenFailure(hostname, cause)` is the small testable helper that pushes onto the sink with source="mitm.GenerateHostCert" and a cause string that names the failing hostname (operators chasing a silent MITM regression need to know WHICH host couldn't be intercepted). mitm_proxy.go (line 22): - The pre-existing `logrus.Error + close conn` path now also calls `ca.reportCertGenFailure(hostname, err)`. logrus is preserved for backwards-compat with existing log greps; the sink call adds the structured surface that the Rust polling consumer routes to tracing. main.go: - After NewBoxCAFromPEM, the new `ca.SetErrSink(instance.errSink)` wires the per-instance sink into the CA so cert-gen failures land in the same queue as the gvproxy goroutine errors. Tests (3 in mitm_silent_audit_test.go): - TestBoxCA_ReportCertGenFailure_NoSink_IsNoop — safe to call pre-SetErrSink without panic. - TestBoxCA_ReportCertGenFailure_WithSink_RoutesToRuntime — the production wiring contract: with sink set, the call pushes a source="mitm.GenerateHostCert" event whose cause names the failing hostname and wraps the original error. - TestBoxCA_SetErrSink_NilIsHonored — passing nil reverts to no-op behavior. The test file header contains the full audit of all 6 silent sites in mitm.go / mitm_proxy.go / mitm_websocket.go with the per-site classification (YES → wire to sink, NO → per-conn legitimate log) and rationale. Only the cert-gen site is wired in this commit; per-conn TLS handshake / upstream dial / response read / websocket hijack errors are documented as "expected failure modes, not silent regressions worth surfacing" and stay log-only. Two-sided: short-circuiting reportCertGenFailure to `return` (before the sink push) reds RoutesToRuntime with "pre-fix silent: MITM cert gen failed, only logrus saw it. The Rust runtime + box log file never observed the failure". Restored → green. Test totals after this commit: - Go: 19 (prior) + 3 mitm = 22 - Rust: 4 (FFI/version/inject/destroy) + 3 polling = 7 ── ── 29 total tests; 14 two-side toggle red'd across this PR's history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Extends #612 to cover the four sibling silent-failure goroutines it didn't touch. PR slot
fix/lifecycle-exit-non-zero-on-rest-errorsreused per request.What's broken (before this PR)
#612 surfaced
virtualnetwork.Newbind errors via an ad-hocinitErrchannel. Four sibling sites in the same goroutine were stilllogrus.Error + return— the Rust runtime never sees them:main.go:454OverrideTCPHandlerallow_net/ MITM silently disabled — box healthy, traffic bypasses filtermain.go:475transport.AcceptVfkit(macOS)main.go:484vn.AcceptVfkitmain.go:496listener.Accept(Linux)main.go:510vn.AcceptQemuOverrideTCPHandleris the security-shaped one —allow_netconfigured but filter not installed = box looks fine, traffic flows anywhere.Two-sided verification — 5/5 sites toggled red
Every silent site is extracted into a testable helper, mocked with an injectable failure, and proven red when the
sink.Runtime/sink.Initline is removed.Layer 1: ErrSink framework contracts (9 unit tests,
error_sink_test.go)Toggled 7 contract lines (Init success/error, FIFO order, drop counter, producer-non-blocking, render with timestamp); each reds the relevant test. Full table at bottom.
Layer 2: Rust FFI integration (3 tests,
libgvproxy-sys/src/lib.rs)End-to-end through cgo:
gvproxy_test_create_for_polling→gvproxy_test_inject_runtime_error→gvproxy_poll_runtime_error→gvproxy_free_string→gvproxy_destroy. Validates the actual call sequence the production Rust runtime will use.Toggle: replace
re := instance.errSink.PollRuntime()withre := (*RuntimeError)(nil)→ 2/3 red.Layer 3: Production wire sites — 5/5 silent sites, each with two-sided proof
Each silent goroutine is extracted into a small testable helper (
runQemuAcceptLoop,runVfkitAcceptLoop,installTCPOverride), and tested by injecting a mock failure via function-value parameters.qemu_accept_loop.go: removesink.Runtime("listener.Accept", err)qemu_accept_loop.go: removesink.Runtime("vn.AcceptQemu", err)vfkit_accept_loop.go: removesink.Runtime("transport.AcceptVfkit", err)vfkit_accept_loop.go: removesink.Runtime("vn.AcceptVfkit", err)install_tcp_override.go: swallowinstallFn()returnPlus 2 negative-control tests (
*_CancelDoesNotSinkRuntime) on the Qemu and Vfkit loops — planned shutdown viactx.Cancelmust NOT pollute the runtime queue. Without these, everygvproxy_destroywould push self-inflicted "closed connection" noise.Test plan
19 Go tests + 3 Rust FFI tests = 22 total, all green; 12 toggle red'd (7 framework + 5 wire sites); all restored to green.
What's added (brief)
error_sink.go(~165) —ErrSinkframework: Init (cap 1, blocking viaWaitInit) + Runtime (cap 16, non-blocking producer; drop counter)qemu_accept_loop.go(~80) —runQemuAcceptLoop(ctx, id, listener, acceptQemu func, sink), function-value injection for the protocol stagevfkit_accept_loop.go(~80) —runVfkitAcceptLoop(ctx, id, conn, acceptVfkitTransport func, acceptVfkit func, sink)install_tcp_override.go(~30) —installTCPOverride(hasFilter, installFn func)so the inline init goroutine routes through a single pointmain.go—GvproxyInstance.errSinkfield; init goroutine usessink.Init/.Runtime; OverrideTCPHandler MOVED into init phase (failure now fail-fasts box.create instead of silently disabling allow_net); 3 inline goroutines replaced by extracted helper callsgvproxy_poll_runtime_error(new//export) — Rust polls via thisgvproxy_test_create_for_polling+gvproxy_test_inject_runtime_error(test-only//export) — hermetic Rust integration testsOut of scope (follow-up)
gvproxy_poll_runtime_errorevery ~250ms → writes into box log as structuredNetworkerror). Framework + FFI are ready; the consumer wiring is a separate PR.mitm.go/forked_tcp.gofor similar silent-goroutine patterns now that the framework exists.Layer 1 toggle table (full)
s.initErr <- err→s.initErr <- nilinInit()error pathInit_ErrorWrapsSourceAndUnblocks<-s.initErrblockingWaitInit_BlocksUntilInitCalledRuntime(nil)enqueues empty eventRuntime_NilIsNoopRuntime_EnqueuesAndPollDrainsFIFOatomic.AddInt64(&s.dropped, 1)Runtime_FullQueueDropsAndIncrementsCounterselect default:with blocking sendRuntime_ProducerNeverBlocksUnderBackpressureWhenfield fromString()renderRuntimeError_StringIncludesSourceTimestampAndCause🤖 Generated with Claude Code
Update — Rust consumer + MITM cert-gen wire (commit
c781a8f5)The framework is now externally observable end-to-end. The two pieces previously listed as "out of scope" are in:
Rust-side poll consumer (
src/boxlite/src/net/gvproxy/)start_runtime_error_pollingtokio task spawned alongside the existingstart_stats_logginginGvproxyBackend::new. Pollsgvproxy_poll_runtime_errorevery 250ms, drains every queued event per cycle, routes each intotracing::warn!with targetgvproxy.runtime_error+instance_id+eventfields.3 Rust tests (require
--features gvproxy):ffi_poll_runtime_error_round_tripspoll_and_route_once_drains_and_emits_tracing_per_event— inject 3, assert all 3 in captured tracing outputpoll_and_route_once_on_empty_queue_emits_nothing— steady-state no-noise pinToggle: short-circuiting
poll_and_route_oncetoreturn 0reds the drains test withleft: 0 right: 3.MITM cert-gen wire (
mitm.go/mitm_proxy.go)The security-shaped silent failure (persistent cert-gen failure = MITM silently disabled for hostname) now feeds the sink.
BoxCAgains an optionalerrSink+SetErrSinksetter;reportCertGenFailure(hostname, cause)pushessource="mitm.GenerateHostCert"events whose cause names the failing hostname.3 Go tests in
mitm_silent_audit_test.go(file header has the full audit table of all 6 silent sites + per-site YES/NO classification + rationale for each — only cert-gen wired this PR, per-conn TLS/dial/WS warnings stay log-only as documented):TestBoxCA_ReportCertGenFailure_NoSink_IsNoopTestBoxCA_ReportCertGenFailure_WithSink_RoutesToRuntimeTestBoxCA_SetErrSink_NilIsHonoredToggle: short-circuiting
reportCertGenFailureredsRoutesToRuntime.Test totals now