Harden SMB startup, quieter protocol errors, and add tests/benches by lukekim · Pull Request #21 · spiceai/spiceio

lukekim · 2026-05-14T03:52:55Z

Summary

Three resilience changes plus the tests and benchmarks for them.

Setup action dumps the spiceio log on failure (both unexpected exit and timeout) and the grace window goes from 30s to 60s. Previously the log was written to RUNNER_TEMP but never echoed, so any startup stall past 30s was undiagnosable. The earlier successful baseline took ~12s, so 30s left only ~2.5x headroom on slow server days.
TCP connect timeout (15s) wrapping TcpStream::connect. Without it, a server dropping SYNs leaves the OS waiting 75-90s and stalls pool init past any sensible CI window.
Pool connection retry via a new retry_with_backoff helper used by SmbPool::connect with a 250ms / 750ms / 2s schedule (4 attempts). A flaky connection during startup no longer takes down the whole pool init.
Quieter protocol-layer logging: smb_status_to_io_error was serr!-ing every SMB status, including expected ones (NotFound on HEAD probes, SharingViolation during WAL cleanup). The unconditional log is gone — mapped statuses return their typed io::Error silently and the catchall arm still logs truly unknown statuses. STATUS_SHARING_VIOLATION (0xC0000043) now maps to ErrorKind::ResourceBusy.
Tests (+21, now 142 total) covering the full status-code mapping, the retry helper semantics (including elapsed-time floor and exhaustion behavior), and the newly public parse_compound_response (moved from client.rs to protocol.rs).
Benches (+3 in protocol_bench.rs): parse_compound_response over n=2/4/8, pipelined_read_decode at (depth, chunk) = (8,64K) / (64,64K) / (64,8K) with throughput reporting (GetObject hot-path inner loop), and pipelined_write_encode at (8,64K) / (64,64K) / (64,1M) (WAL pipelined-write inner loop).

Test plan

make lint clean (fmt-check + clippy -D warnings + rustdoc -D warnings)
cargo test --lib — 142 passed, 0 failed
cargo bench --bench protocol_bench -- --test — every bench case runs its smoke iteration
cargo bench --bench protocol_bench -- 'pipelined_read_decode/d64_c65536' --quick — produces real numbers (~69 µs / 56 GiB/s on the dev machine)
CI green on this branch — re-run the previously failing job and confirm the new action surfaces the spiceio log if startup stalls

Three resilience changes plus the tests and benchmarks for them. Setup action: dump the spiceio log on failure (both unexpected exit and timeout). The log was being written to RUNNER_TEMP but never echoed, so when startup stalled past the grace window there was no way to diagnose which phase hung. Also doubled the grace from 30s to 60s — the previous good run took ~12s, leaving only ~2.5x headroom for a slow server day. TCP connect timeout: TcpStream::connect had no spiceio-level timeout, so a server dropping SYNs left the OS waiting 75-90s and stalled pool init past any sensible CI window. Wrapped in tokio::time::timeout(15s) with explicit TimedOut error. Pool connection retry: extracted retry_with_backoff helper used by SmbPool::connect with a 250ms/750ms/2s schedule (4 attempts). A flaky connection during startup no longer takes down the whole pool init. Quieter protocol-layer logging: smb_status_to_io_error was emitting an error log for every SMB status, including expected ones (NotFound on HEAD probes, SharingViolation during WAL cleanup). Removed the unconditional log; mapped statuses return their typed io::Error silently and the catchall arm still logs for truly unknown statuses. Added STATUS_SHARING_VIOLATION (0xC0000043) -> ErrorKind::ResourceBusy. Tests (+21, now 142 total): - smb_status_to_io_error: full mapping coverage including the new ResourceBusy case, unknown-status fallback, STATUS_SUCCESS panic guard, path preservation - retry_with_backoff: first-attempt success, success after transient failures, exhaustion preserving last error, empty-backoff edge case, elapsed-time floor from the schedule, and the structural invariant on CONNECT_RETRY_BACKOFF - parse_compound_response (moved from client.rs to protocol.rs as pub): single/multi-message, empty, truncated header, malformed next_command Benches (+3 in protocol_bench.rs): - parse_compound_response over n=2,4,8 chained messages - pipelined_read_decode at (depth, chunk_size) = (8,64K), (64,64K), (64,8K) — the GetObject hot-path inner loop with throughput reporting - pipelined_write_encode at (depth, chunk_size) = (8,64K), (64,64K), (64,1M) — the WAL pipelined-write inner loop

Copilot

Pull request overview

This PR strengthens SMB startup resilience (CI + runtime) by adding bounded timeouts, connection retries with backoff, and quieter protocol-layer error logging, alongside new tests and protocol benchmarks to validate and measure the changes.

Changes:

Add a TCP connect timeout and a pooled connect retry helper with a fixed backoff schedule.
Reduce noisy SMB status logging while extending status→io::ErrorKind mappings (e.g., sharing violations).
Add tests for retry/mapping behavior and new benches for compound parsing and pipelined read/write inner loops; improve CI setup action startup diagnostics.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`src/smb/protocol.rs`	Exposes `parse_compound_response` and adds unit tests for compound parsing behavior.
`src/smb/pool.rs`	Introduces `retry_with_backoff` and applies it to SMB pool connection establishment; adds tests.
`src/smb/client.rs`	Adds TCP connect timeout; adjusts SMB status→`io::Error` mapping/logging; adds mapping tests.
`benches/protocol_bench.rs`	Adds benches for compound parsing and pipelined read/write encode/decode loops.
`.github/actions/setup/action.yml`	Extends startup wait window and prints spiceio logs on failure/timeout for CI debuggability.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Wires two new live-test scripts into the CI job: - scripts/test-extended.sh — exercises operations not covered by test-sccache.sh: multipart uploads (single + N concurrent at 10MB each), range GETs (sequential + N concurrent slices of a 4MB file), multi-delete (DeleteObjects batch via aws s3api delete-objects), conditional writes (If-None-Match: * happy and 412 paths plus N racing writers documenting the observable winner/loser ratio), ListObjectsV2 during concurrent PUTs, and streaming GET cancellation (verifies spiceio stays healthy after a client disconnects mid-stream). - scripts/stress-concurrent.sh — already existed but was not in CI. Adds it: concurrent writes to distinct keys, concurrent reads of the same key, write-then-read (sccache pattern), mixed read/write contention on the same key (with data-corruption guard), and concurrent large-file pipelined I/O. Each script runs on its own port (18335, 18336) so they don't collide with the existing sccache test on 18333. Defensive fix to test-sccache.sh: AWS CLI now gets --region explicitly, and the first ListBuckets call retries up to 3× and surfaces stderr on failure. Previously a missing AWS_DEFAULT_REGION on the runner would cause aws s3 ls to fail and set -e would kill the script before its captured stderr ever got printed, leaving the real error invisible. The CI job now also sets AWS_DEFAULT_REGION for the new test steps.

Copilot

Pull request overview

Copilot reviewed 10 out of 11 changed files in this pull request and generated 2 comments.

- smb_status_to_io_error doc clarified: only the fallback arm formats the raw hex; mapped arms rely on the typed ErrorKind. - retry_with_backoff: emit a concise "retrying (attempt N/max) in Xms" notice at slog level. SmbClient::connect already logs the underlying error per attempt, so the previous duplicate "attempt N/M failed: {e}" line was pure noise. - test-extended.sh section 7: relax the racing conditional-write check so the test fails only on hangs ("000") or a missing response, not on the occasional 5xx surfaced by SMB-level contention. Print the offending status lines on the rare path so we can investigate without flaking CI.

Copilot AI review requested due to automatic review settings May 14, 2026 03:52

Copilot started reviewing on behalf of lukekim May 14, 2026 03:53 View session

lukekim self-assigned this May 14, 2026

lukekim added the enhancement New feature or request label May 14, 2026

Bump version to v0.5.2

ab7102f

Copilot AI reviewed May 14, 2026

View reviewed changes

Comment thread src/smb/protocol.rs Outdated

Comment thread src/smb/client.rs Outdated

lukekim added 2 commits May 14, 2026 13:02

Address PR review: tighten bad next_command test, remove orphan comment

5cb182b

Copilot AI review requested due to automatic review settings May 14, 2026 04:15

Copilot started reviewing on behalf of lukekim May 14, 2026 04:16 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

Comment thread src/smb/client.rs Outdated

Comment thread src/smb/pool.rs

lukekim merged commit 21d4dac into trunk May 14, 2026
4 checks passed

lukekim deleted the smb-startup-resilience branch May 14, 2026 05:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden SMB startup, quieter protocol errors, and add tests/benches#21

Harden SMB startup, quieter protocol errors, and add tests/benches#21
lukekim merged 5 commits into
trunkfrom
smb-startup-resilience

lukekim commented May 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lukekim commented May 14, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants