Skip to content

Harden SMB startup, quieter protocol errors, and add tests/benches#21

Merged
lukekim merged 5 commits into
trunkfrom
smb-startup-resilience
May 14, 2026
Merged

Harden SMB startup, quieter protocol errors, and add tests/benches#21
lukekim merged 5 commits into
trunkfrom
smb-startup-resilience

Conversation

@lukekim
Copy link
Copy Markdown
Contributor

@lukekim lukekim commented May 14, 2026

Summary

Three resilience changes plus the tests and benchmarks for them.

  • Setup action dumps the spiceio log on failure (both unexpected exit and timeout) and the grace window goes from 30s to 60s. Previously the log was written to RUNNER_TEMP but never echoed, so any startup stall past 30s was undiagnosable. The earlier successful baseline took ~12s, so 30s left only ~2.5x headroom on slow server days.
  • TCP connect timeout (15s) wrapping TcpStream::connect. Without it, a server dropping SYNs leaves the OS waiting 75-90s and stalls pool init past any sensible CI window.
  • Pool connection retry via a new retry_with_backoff helper used by SmbPool::connect with a 250ms / 750ms / 2s schedule (4 attempts). A flaky connection during startup no longer takes down the whole pool init.
  • Quieter protocol-layer logging: smb_status_to_io_error was serr!-ing every SMB status, including expected ones (NotFound on HEAD probes, SharingViolation during WAL cleanup). The unconditional log is gone — mapped statuses return their typed io::Error silently and the catchall arm still logs truly unknown statuses. STATUS_SHARING_VIOLATION (0xC0000043) now maps to ErrorKind::ResourceBusy.
  • Tests (+21, now 142 total) covering the full status-code mapping, the retry helper semantics (including elapsed-time floor and exhaustion behavior), and the newly public parse_compound_response (moved from client.rs to protocol.rs).
  • Benches (+3 in protocol_bench.rs): parse_compound_response over n=2/4/8, pipelined_read_decode at (depth, chunk) = (8,64K) / (64,64K) / (64,8K) with throughput reporting (GetObject hot-path inner loop), and pipelined_write_encode at (8,64K) / (64,64K) / (64,1M) (WAL pipelined-write inner loop).

Test plan

  • make lint clean (fmt-check + clippy -D warnings + rustdoc -D warnings)
  • cargo test --lib — 142 passed, 0 failed
  • cargo bench --bench protocol_bench -- --test — every bench case runs its smoke iteration
  • cargo bench --bench protocol_bench -- 'pipelined_read_decode/d64_c65536' --quick — produces real numbers (~69 µs / 56 GiB/s on the dev machine)
  • CI green on this branch — re-run the previously failing job and confirm the new action surfaces the spiceio log if startup stalls

Three resilience changes plus the tests and benchmarks for them.

Setup action: dump the spiceio log on failure (both unexpected exit and
timeout). The log was being written to RUNNER_TEMP but never echoed, so
when startup stalled past the grace window there was no way to diagnose
which phase hung. Also doubled the grace from 30s to 60s — the previous
good run took ~12s, leaving only ~2.5x headroom for a slow server day.

TCP connect timeout: TcpStream::connect had no spiceio-level timeout, so
a server dropping SYNs left the OS waiting 75-90s and stalled pool init
past any sensible CI window. Wrapped in tokio::time::timeout(15s) with
explicit TimedOut error.

Pool connection retry: extracted retry_with_backoff helper used by
SmbPool::connect with a 250ms/750ms/2s schedule (4 attempts). A flaky
connection during startup no longer takes down the whole pool init.

Quieter protocol-layer logging: smb_status_to_io_error was emitting an
error log for every SMB status, including expected ones (NotFound on HEAD
probes, SharingViolation during WAL cleanup). Removed the unconditional
log; mapped statuses return their typed io::Error silently and the
catchall arm still logs for truly unknown statuses. Added
STATUS_SHARING_VIOLATION (0xC0000043) -> ErrorKind::ResourceBusy.

Tests (+21, now 142 total):
- smb_status_to_io_error: full mapping coverage including the new
  ResourceBusy case, unknown-status fallback, STATUS_SUCCESS panic guard,
  path preservation
- retry_with_backoff: first-attempt success, success after transient
  failures, exhaustion preserving last error, empty-backoff edge case,
  elapsed-time floor from the schedule, and the structural invariant on
  CONNECT_RETRY_BACKOFF
- parse_compound_response (moved from client.rs to protocol.rs as pub):
  single/multi-message, empty, truncated header, malformed next_command

Benches (+3 in protocol_bench.rs):
- parse_compound_response over n=2,4,8 chained messages
- pipelined_read_decode at (depth, chunk_size) = (8,64K), (64,64K), (64,8K)
  — the GetObject hot-path inner loop with throughput reporting
- pipelined_write_encode at (depth, chunk_size) = (8,64K), (64,64K),
  (64,1M) — the WAL pipelined-write inner loop
Copilot AI review requested due to automatic review settings May 14, 2026 03:52
@lukekim lukekim self-assigned this May 14, 2026
@lukekim lukekim added the enhancement New feature or request label May 14, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR strengthens SMB startup resilience (CI + runtime) by adding bounded timeouts, connection retries with backoff, and quieter protocol-layer error logging, alongside new tests and protocol benchmarks to validate and measure the changes.

Changes:

  • Add a TCP connect timeout and a pooled connect retry helper with a fixed backoff schedule.
  • Reduce noisy SMB status logging while extending status→io::ErrorKind mappings (e.g., sharing violations).
  • Add tests for retry/mapping behavior and new benches for compound parsing and pipelined read/write inner loops; improve CI setup action startup diagnostics.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/smb/protocol.rs Exposes parse_compound_response and adds unit tests for compound parsing behavior.
src/smb/pool.rs Introduces retry_with_backoff and applies it to SMB pool connection establishment; adds tests.
src/smb/client.rs Adds TCP connect timeout; adjusts SMB status→io::Error mapping/logging; adds mapping tests.
benches/protocol_bench.rs Adds benches for compound parsing and pipelined read/write encode/decode loops.
.github/actions/setup/action.yml Extends startup wait window and prints spiceio logs on failure/timeout for CI debuggability.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/smb/protocol.rs Outdated
Comment thread src/smb/client.rs Outdated
lukekim added 2 commits May 14, 2026 13:02
Wires two new live-test scripts into the CI job:

- scripts/test-extended.sh — exercises operations not covered by
  test-sccache.sh: multipart uploads (single + N concurrent at 10MB
  each), range GETs (sequential + N concurrent slices of a 4MB file),
  multi-delete (DeleteObjects batch via aws s3api delete-objects),
  conditional writes (If-None-Match: * happy and 412 paths plus N
  racing writers documenting the observable winner/loser ratio),
  ListObjectsV2 during concurrent PUTs, and streaming GET
  cancellation (verifies spiceio stays healthy after a client
  disconnects mid-stream).

- scripts/stress-concurrent.sh — already existed but was not in CI.
  Adds it: concurrent writes to distinct keys, concurrent reads of
  the same key, write-then-read (sccache pattern), mixed read/write
  contention on the same key (with data-corruption guard), and
  concurrent large-file pipelined I/O.

Each script runs on its own port (18335, 18336) so they don't collide
with the existing sccache test on 18333.

Defensive fix to test-sccache.sh: AWS CLI now gets --region
explicitly, and the first ListBuckets call retries up to 3× and
surfaces stderr on failure. Previously a missing AWS_DEFAULT_REGION
on the runner would cause aws s3 ls to fail and set -e would kill
the script before its captured stderr ever got printed, leaving the
real error invisible. The CI job now also sets AWS_DEFAULT_REGION
for the new test steps.
Copilot AI review requested due to automatic review settings May 14, 2026 04:15
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 11 changed files in this pull request and generated 2 comments.

Comment thread src/smb/client.rs Outdated
Comment thread src/smb/pool.rs
- smb_status_to_io_error doc clarified: only the fallback arm formats
  the raw hex; mapped arms rely on the typed ErrorKind.
- retry_with_backoff: emit a concise "retrying (attempt N/max) in Xms"
  notice at slog level. SmbClient::connect already logs the underlying
  error per attempt, so the previous duplicate "attempt N/M failed: {e}"
  line was pure noise.
- test-extended.sh section 7: relax the racing conditional-write check
  so the test fails only on hangs ("000") or a missing response, not on
  the occasional 5xx surfaced by SMB-level contention. Print the offending
  status lines on the rare path so we can investigate without flaking CI.
@lukekim lukekim merged commit 21d4dac into trunk May 14, 2026
4 checks passed
@lukekim lukekim deleted the smb-startup-resilience branch May 14, 2026 05:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants