Skip to content

feat(logs): tee exec stdout/stderr to shared/container.log#631

Draft
G4614 wants to merge 1 commit into
boxlite-ai:mainfrom
G4614:fix/logs-container-stdout
Draft

feat(logs): tee exec stdout/stderr to shared/container.log#631
G4614 wants to merge 1 commit into
boxlite-ai:mainfrom
G4614:fix/logs-container-stdout

Conversation

@G4614
Copy link
Copy Markdown
Contributor

@G4614 G4614 commented Jun 1, 2026

Replaces the original pipe-relay approach (force-pushed). That design tried to capture container PID 1's stdio; investigation showed it can never work — boxlite run alpine sh -c 'echo X; sleep N' runs the image's default CMD (/bin/sh) as PID 1 and threads the user command through a tenant exec, so init's stdio is irrelevant. The bytes the operator wants live in the exec stream and were never captured. See PR history for the full pipe-relay investigation (fd inodes matched, syscw=0 on PID 1, etc.) — sound diagnosis, wrong target.

This rewrite teest the exec stream instead, which is where the bytes actually live.

What changed

Insert a per-stream tee thread at every exec spawn point (4 paths: container pipe + PTY, guest pipe + PTY). For each spawn the guest:

  1. Opens /run/boxlite/shared/container.log (append + create) — the file is on the virtio-fs share, so the host sees it instantly at <box_dir>/shared/container.log.
  2. Creates a relay pipe whose read-end takes the place of the child's stdout/stderr fd in ExecHandle.
  3. Spawns a thread that reads the original fd, fsync-appends to the file, then forwards to the relay pipe.

The fsync after every batch is load-bearing on virtio-fs — without it the host-side reads can lag the guest's page cache by minutes. If /run/boxlite/shared isn't mounted (test contexts), the tee gracefully returns the original fd unchanged and the exec runs without logging.

CLI: boxlite logs <box> reads container.log by default; new --vm flag explicitly requests the old VM/agent console behavior (for the cases where operators do need kernel boot or guest-agent tracing).

Why opt-in (--vm) over silent default switch

The pre-tee behavior — console.log (VM/agent tracing) — is still useful for diagnosing boot failures and guest-agent crashes. Keeping it on --vm rather than removing it preserves that operator path. Default is the Docker-user expected path.

Test plan

THE load-bearing test

boxlite_logs_shows_container_stdout (src/cli/tests/logs_capture.rs):

  • Boots an alpine box with sh -c 'echo MARK_LOGS_TEE_OK; sleep 60'
  • Sleeps 2s for the tee to fsync
  • boxlite logs <box> must contain MARK_LOGS_TEE_OK

Strict two-side verified (per CLAUDE.md, with the documented "keep API surface for the test to address" deviation from the literal "revert every non-test file" bullet):

side what was reverted result failure signal
A (operative fix reverted) src/guest/src/service/exec/tee.rs deleted, four wrap_with_tee call sites reverted in command.rs + executor.rs. Kept: layout::container_output_path() helper + CLI --vm flag as scaffolding so the test compiles and clap accepts the existing args (test wasn't using --vm, but the flag's presence proves the scaffolding wasn't load-bearing for failure). 1 FAIL at the marker assertion — MARK_LOGS_TEE_OK absent because container.log doesn't get written; falls back to console.log with the documented one-line notice. 1 PASS on --vm sanity check (was always going to work — proves the scaffolding doesn't accidentally green the test). wrong content in boxlite logs output
B (full fix) nothing 2/2 PASS in 17 s

Why keep scaffolding on Side A: a literal full revert would drop the container_output_path() helper too, and logs.rs would fail to compile. The test then fails for "binary won't build" instead of "the bug isn't fixed" — wrong failure signal. Per CLAUDE.md's "if a full revert is genuinely impractical, stop and surface that — do not paper over it" escape clause.

Unit coverage

Two unit tests in tee.rs (run via cargo nextest -p boxlite-guest tee):

  • wrap_with_tee_returns_source_when_log_unopenable — fallback path, exec keeps working without logging
  • wrap_with_tee_writes_to_both_relay_and_file — end-to-end fanout proof

Existing tests still pass

  • cargo nextest -p boxlite-cli --test logs_capture — 2 tests (17 s)
  • cargo nextest -p boxlite-guest tee — 2 unit tests (1 s)

Run

make cli
cargo nextest run -p boxlite-cli --tests --profile vm -E 'binary(logs_capture)'

Architectural caveat (documented in commit body)

The bug is structurally that boxlite runs the image's default CMD as PID 1 and threads the user command through a tenant exec — docker logs users expect the user CMD to be PID 1. This PR addresses the symptom (operator can see their output) without changing the architecture. A larger refactor that runs detached user commands as PID 1 directly (Docker-shape) would obsolete this tee but is out of scope. See the prior PR description history for the full investigation.

@G4614 G4614 force-pushed the fix/logs-container-stdout branch from 1412996 to 7eb5698 Compare June 1, 2026 04:22
`boxlite logs <box>` used to return the VM/guest-agent console only
(kernel boot, zygote spawn, tmpfs mounts, network init) and never the
user's program output. Root cause is architectural rather than a
plumbing miss: `boxlite run alpine sh -c 'echo X; sleep N'` runs the
image's default CMD (`/bin/sh`) as PID 1 and threads the user command
through a tenant `exec`, so init's stdio is irrelevant — the bytes
the operator wants live in the exec stream and were never captured.

This change inserts a per-stream tee thread at every exec spawn point
(container pipe path, container PTY path, guest-direct pipe path,
guest-direct PTY path). For each spawn the guest:

  1. opens `/run/boxlite/shared/container.log` (append + create),
  2. creates a relay pipe whose read-end takes the place of the
     child's stdout/stderr fd in `ExecHandle`,
  3. spawns a thread that reads the original fd, fsync-appends to
     the file, then forwards to the relay pipe.

The file is on the virtio-fs share, so the host sees it instantly at
`<box_dir>/shared/container.log`. The `fsync` after every batch is
load-bearing — without it the host-side reads can lag the guest's
page cache by minutes. If `/run/boxlite/shared` isn't mounted (test
contexts), the tee gracefully returns the original fd and the exec
runs unchanged.

CLI side: `boxlite logs <box>` now reads `container.log` by default,
falling back to `console.log` with a one-line notice on older boxes
that predate the tee. A new `--vm` flag explicitly requests the old
behavior (operators who do need kernel boot / guest-agent tracing).

Pre-existing console.log path is unchanged. No semantics change for
existing `boxlite logs --follow`, `--tail`, etc.

Two-side verified (`src/cli/tests/logs_capture.rs`):
- Side B (full fix): 2/2 PASS in 17 s.
- Side A (revert `tee.rs` + the four `wrap_with_tee` call sites,
  keep layout helper + CLI `--vm` flag as scaffolding): 1 FAIL
  (the bug signal — `MARK_LOGS_TEE_OK` missing from container output)
  + 1 PASS (the `--vm` sanity check still works). Documented in
  test doc-comment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 force-pushed the fix/logs-container-stdout branch from 7eb5698 to bd28103 Compare June 3, 2026 04:22
@G4614 G4614 changed the title fix(logs): route container PID-1 stdio to container.log [WIP] feat(logs): tee exec stdout/stderr to shared/container.log Jun 3, 2026
@cla-assistant
Copy link
Copy Markdown

cla-assistant Bot commented Jun 6, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Ubuntu seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant