Harden SMB startup, quieter protocol errors, and add tests/benches#21
Merged
Conversation
Three resilience changes plus the tests and benchmarks for them. Setup action: dump the spiceio log on failure (both unexpected exit and timeout). The log was being written to RUNNER_TEMP but never echoed, so when startup stalled past the grace window there was no way to diagnose which phase hung. Also doubled the grace from 30s to 60s — the previous good run took ~12s, leaving only ~2.5x headroom for a slow server day. TCP connect timeout: TcpStream::connect had no spiceio-level timeout, so a server dropping SYNs left the OS waiting 75-90s and stalled pool init past any sensible CI window. Wrapped in tokio::time::timeout(15s) with explicit TimedOut error. Pool connection retry: extracted retry_with_backoff helper used by SmbPool::connect with a 250ms/750ms/2s schedule (4 attempts). A flaky connection during startup no longer takes down the whole pool init. Quieter protocol-layer logging: smb_status_to_io_error was emitting an error log for every SMB status, including expected ones (NotFound on HEAD probes, SharingViolation during WAL cleanup). Removed the unconditional log; mapped statuses return their typed io::Error silently and the catchall arm still logs for truly unknown statuses. Added STATUS_SHARING_VIOLATION (0xC0000043) -> ErrorKind::ResourceBusy. Tests (+21, now 142 total): - smb_status_to_io_error: full mapping coverage including the new ResourceBusy case, unknown-status fallback, STATUS_SUCCESS panic guard, path preservation - retry_with_backoff: first-attempt success, success after transient failures, exhaustion preserving last error, empty-backoff edge case, elapsed-time floor from the schedule, and the structural invariant on CONNECT_RETRY_BACKOFF - parse_compound_response (moved from client.rs to protocol.rs as pub): single/multi-message, empty, truncated header, malformed next_command Benches (+3 in protocol_bench.rs): - parse_compound_response over n=2,4,8 chained messages - pipelined_read_decode at (depth, chunk_size) = (8,64K), (64,64K), (64,8K) — the GetObject hot-path inner loop with throughput reporting - pipelined_write_encode at (depth, chunk_size) = (8,64K), (64,64K), (64,1M) — the WAL pipelined-write inner loop
There was a problem hiding this comment.
Pull request overview
This PR strengthens SMB startup resilience (CI + runtime) by adding bounded timeouts, connection retries with backoff, and quieter protocol-layer error logging, alongside new tests and protocol benchmarks to validate and measure the changes.
Changes:
- Add a TCP connect timeout and a pooled connect retry helper with a fixed backoff schedule.
- Reduce noisy SMB status logging while extending status→
io::ErrorKindmappings (e.g., sharing violations). - Add tests for retry/mapping behavior and new benches for compound parsing and pipelined read/write inner loops; improve CI setup action startup diagnostics.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/smb/protocol.rs |
Exposes parse_compound_response and adds unit tests for compound parsing behavior. |
src/smb/pool.rs |
Introduces retry_with_backoff and applies it to SMB pool connection establishment; adds tests. |
src/smb/client.rs |
Adds TCP connect timeout; adjusts SMB status→io::Error mapping/logging; adds mapping tests. |
benches/protocol_bench.rs |
Adds benches for compound parsing and pipelined read/write encode/decode loops. |
.github/actions/setup/action.yml |
Extends startup wait window and prints spiceio logs on failure/timeout for CI debuggability. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Wires two new live-test scripts into the CI job: - scripts/test-extended.sh — exercises operations not covered by test-sccache.sh: multipart uploads (single + N concurrent at 10MB each), range GETs (sequential + N concurrent slices of a 4MB file), multi-delete (DeleteObjects batch via aws s3api delete-objects), conditional writes (If-None-Match: * happy and 412 paths plus N racing writers documenting the observable winner/loser ratio), ListObjectsV2 during concurrent PUTs, and streaming GET cancellation (verifies spiceio stays healthy after a client disconnects mid-stream). - scripts/stress-concurrent.sh — already existed but was not in CI. Adds it: concurrent writes to distinct keys, concurrent reads of the same key, write-then-read (sccache pattern), mixed read/write contention on the same key (with data-corruption guard), and concurrent large-file pipelined I/O. Each script runs on its own port (18335, 18336) so they don't collide with the existing sccache test on 18333. Defensive fix to test-sccache.sh: AWS CLI now gets --region explicitly, and the first ListBuckets call retries up to 3× and surfaces stderr on failure. Previously a missing AWS_DEFAULT_REGION on the runner would cause aws s3 ls to fail and set -e would kill the script before its captured stderr ever got printed, leaving the real error invisible. The CI job now also sets AWS_DEFAULT_REGION for the new test steps.
- smb_status_to_io_error doc clarified: only the fallback arm formats
the raw hex; mapped arms rely on the typed ErrorKind.
- retry_with_backoff: emit a concise "retrying (attempt N/max) in Xms"
notice at slog level. SmbClient::connect already logs the underlying
error per attempt, so the previous duplicate "attempt N/M failed: {e}"
line was pure noise.
- test-extended.sh section 7: relax the racing conditional-write check
so the test fails only on hangs ("000") or a missing response, not on
the occasional 5xx surfaced by SMB-level contention. Print the offending
status lines on the rare path so we can investigate without flaking CI.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three resilience changes plus the tests and benchmarks for them.
RUNNER_TEMPbut never echoed, so any startup stall past 30s was undiagnosable. The earlier successful baseline took ~12s, so 30s left only ~2.5x headroom on slow server days.TcpStream::connect. Without it, a server dropping SYNs leaves the OS waiting 75-90s and stalls pool init past any sensible CI window.retry_with_backoffhelper used bySmbPool::connectwith a 250ms / 750ms / 2s schedule (4 attempts). A flaky connection during startup no longer takes down the whole pool init.smb_status_to_io_errorwasserr!-ing every SMB status, including expected ones (NotFound on HEAD probes, SharingViolation during WAL cleanup). The unconditional log is gone — mapped statuses return their typedio::Errorsilently and the catchall arm still logs truly unknown statuses.STATUS_SHARING_VIOLATION(0xC0000043) now maps toErrorKind::ResourceBusy.parse_compound_response(moved fromclient.rstoprotocol.rs).protocol_bench.rs):parse_compound_responseover n=2/4/8,pipelined_read_decodeat (depth, chunk) = (8,64K) / (64,64K) / (64,8K) with throughput reporting (GetObject hot-path inner loop), andpipelined_write_encodeat (8,64K) / (64,64K) / (64,1M) (WAL pipelined-write inner loop).Test plan
make lintclean (fmt-check + clippy-D warnings+ rustdoc-D warnings)cargo test --lib— 142 passed, 0 failedcargo bench --bench protocol_bench -- --test— every bench case runs its smoke iterationcargo bench --bench protocol_bench -- 'pipelined_read_decode/d64_c65536' --quick— produces real numbers (~69 µs / 56 GiB/s on the dev machine)