Skip to content

Commit 21d4dac

Browse files
authored
Harden SMB startup, quieter protocol errors, and add tests/benches (#21)
* Harden SMB startup, quieter protocol errors, and add tests/benches Three resilience changes plus the tests and benchmarks for them. Setup action: dump the spiceio log on failure (both unexpected exit and timeout). The log was being written to RUNNER_TEMP but never echoed, so when startup stalled past the grace window there was no way to diagnose which phase hung. Also doubled the grace from 30s to 60s — the previous good run took ~12s, leaving only ~2.5x headroom for a slow server day. TCP connect timeout: TcpStream::connect had no spiceio-level timeout, so a server dropping SYNs left the OS waiting 75-90s and stalled pool init past any sensible CI window. Wrapped in tokio::time::timeout(15s) with explicit TimedOut error. Pool connection retry: extracted retry_with_backoff helper used by SmbPool::connect with a 250ms/750ms/2s schedule (4 attempts). A flaky connection during startup no longer takes down the whole pool init. Quieter protocol-layer logging: smb_status_to_io_error was emitting an error log for every SMB status, including expected ones (NotFound on HEAD probes, SharingViolation during WAL cleanup). Removed the unconditional log; mapped statuses return their typed io::Error silently and the catchall arm still logs for truly unknown statuses. Added STATUS_SHARING_VIOLATION (0xC0000043) -> ErrorKind::ResourceBusy. Tests (+21, now 142 total): - smb_status_to_io_error: full mapping coverage including the new ResourceBusy case, unknown-status fallback, STATUS_SUCCESS panic guard, path preservation - retry_with_backoff: first-attempt success, success after transient failures, exhaustion preserving last error, empty-backoff edge case, elapsed-time floor from the schedule, and the structural invariant on CONNECT_RETRY_BACKOFF - parse_compound_response (moved from client.rs to protocol.rs as pub): single/multi-message, empty, truncated header, malformed next_command Benches (+3 in protocol_bench.rs): - parse_compound_response over n=2,4,8 chained messages - pipelined_read_decode at (depth, chunk_size) = (8,64K), (64,64K), (64,8K) — the GetObject hot-path inner loop with throughput reporting - pipelined_write_encode at (depth, chunk_size) = (8,64K), (64,64K), (64,1M) — the WAL pipelined-write inner loop * Bump version to v0.5.2 * Address PR review: tighten bad next_command test, remove orphan comment * Add comprehensive CI live tests and fix region flakiness Wires two new live-test scripts into the CI job: - scripts/test-extended.sh — exercises operations not covered by test-sccache.sh: multipart uploads (single + N concurrent at 10MB each), range GETs (sequential + N concurrent slices of a 4MB file), multi-delete (DeleteObjects batch via aws s3api delete-objects), conditional writes (If-None-Match: * happy and 412 paths plus N racing writers documenting the observable winner/loser ratio), ListObjectsV2 during concurrent PUTs, and streaming GET cancellation (verifies spiceio stays healthy after a client disconnects mid-stream). - scripts/stress-concurrent.sh — already existed but was not in CI. Adds it: concurrent writes to distinct keys, concurrent reads of the same key, write-then-read (sccache pattern), mixed read/write contention on the same key (with data-corruption guard), and concurrent large-file pipelined I/O. Each script runs on its own port (18335, 18336) so they don't collide with the existing sccache test on 18333. Defensive fix to test-sccache.sh: AWS CLI now gets --region explicitly, and the first ListBuckets call retries up to 3× and surfaces stderr on failure. Previously a missing AWS_DEFAULT_REGION on the runner would cause aws s3 ls to fail and set -e would kill the script before its captured stderr ever got printed, leaving the real error invisible. The CI job now also sets AWS_DEFAULT_REGION for the new test steps. * Address PR review and relax conditional-write race assertion - smb_status_to_io_error doc clarified: only the fallback arm formats the raw hex; mapped arms rely on the typed ErrorKind. - retry_with_backoff: emit a concise "retrying (attempt N/max) in Xms" notice at slog level. SmbClient::connect already logs the underlying error per attempt, so the previous duplicate "attempt N/M failed: {e}" line was pure noise. - test-extended.sh section 7: relax the racing conditional-write check so the test fails only on hangs ("000") or a missing response, not on the occasional 5xx surfaced by SMB-level contention. Print the offending status lines on the rare path so we can investigate without flaking CI.
1 parent 18e298c commit 21d4dac

11 files changed

Lines changed: 1015 additions & 57 deletions

File tree

.github/actions/setup/action.yml

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -132,9 +132,12 @@ runs:
132132
# auto-increment if the requested port was busy).
133133
echo "Waiting for spiceio on ${SPICEIO_BIND}..."
134134
ENDPOINT=""
135-
for i in $(seq 1 30); do
135+
for i in $(seq 1 60); do
136136
if ! kill -0 "$PID" 2>/dev/null; then
137137
echo "::error::spiceio exited unexpectedly"
138+
echo "::group::spiceio log"
139+
cat "$SPICEIO_LOG" 2>/dev/null || echo "(log file missing)"
140+
echo "::endgroup::"
138141
exit 1
139142
fi
140143
ENDPOINT=$(grep 'listening on' "$SPICEIO_LOG" 2>/dev/null | grep -o 'http://[^ ]*' | tail -1 || true)
@@ -146,6 +149,10 @@ runs:
146149
fi
147150
sleep 1
148151
done
149-
echo "::error::spiceio failed to start within 30s"
152+
echo "::error::spiceio failed to start within 60s"
153+
echo "::group::spiceio log"
154+
cat "$SPICEIO_LOG" 2>/dev/null || echo "(log file missing)"
155+
echo "::endgroup::"
156+
kill "$PID" 2>/dev/null || true
150157
exit 1
151158

.github/workflows/ci.yml

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,37 @@ jobs:
9696
SPICEIO_REGION: ${{ vars.SPICEIO_REGION || 'us-west-1' }}
9797
run: ./scripts/test-sccache.sh
9898

99+
- name: Run extended S3 operations test
100+
# Multipart, range, multi-delete, conditional writes, list-under-load,
101+
# streaming cancellation. Uses bucket 'extended' and port 18336 so it
102+
# doesn't collide with the sccache test above.
103+
if: ${{ env.HAS_SMB_PASS == 'true' }}
104+
env:
105+
SPICEIO_SMB_SERVER: ${{ vars.SPICEIO_SMB_SERVER || '192.168.3.148' }}
106+
SPICEIO_SMB_USER: ${{ vars.SPICEIO_SMB_USER || 'runner' }}
107+
SPICEIO_SMB_PASS: ${{ secrets.UNAS_SMB_PASS }}
108+
SPICEIO_SMB_SHARE: ${{ vars.SPICEIO_SMB_SHARE || 'ai_platform_dev' }}
109+
SPICEIO_BUCKET: extended
110+
SPICEIO_REGION: ${{ vars.SPICEIO_REGION || 'us-west-1' }}
111+
SPICEIO_BIND: 127.0.0.1:18336
112+
AWS_DEFAULT_REGION: ${{ vars.SPICEIO_REGION || 'us-west-1' }}
113+
run: ./scripts/test-extended.sh
114+
115+
- name: Run concurrent stress test
116+
# Concurrent writes/reads/contention, write-then-read patterns,
117+
# mixed read/write on same key, and large-file pipelined I/O.
118+
if: ${{ env.HAS_SMB_PASS == 'true' }}
119+
env:
120+
SPICEIO_SMB_SERVER: ${{ vars.SPICEIO_SMB_SERVER || '192.168.3.148' }}
121+
SPICEIO_SMB_USER: ${{ vars.SPICEIO_SMB_USER || 'runner' }}
122+
SPICEIO_SMB_PASS: ${{ secrets.UNAS_SMB_PASS }}
123+
SPICEIO_SMB_SHARE: ${{ vars.SPICEIO_SMB_SHARE || 'ai_platform_dev' }}
124+
SPICEIO_BUCKET: stress
125+
SPICEIO_REGION: ${{ vars.SPICEIO_REGION || 'us-west-1' }}
126+
SPICEIO_BIND: 127.0.0.1:18335
127+
AWS_DEFAULT_REGION: ${{ vars.SPICEIO_REGION || 'us-west-1' }}
128+
run: ./scripts/stress-concurrent.sh
129+
99130
- name: Build release artifact
100131
run: cargo build --release --locked --bin spiceio
101132

Cargo.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "spiceio"
3-
version = "0.5.1"
3+
version = "0.5.2"
44
edition = "2024"
55
description = "S3-compatible API proxy to SMB file shares"
66
license = "Apache-2.0"

benches/protocol_bench.rs

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -152,6 +152,133 @@ fn bench_encode_set_info_rename(c: &mut Criterion) {
152152
group.finish();
153153
}
154154

155+
// ── Parser benches for new public API ────────────────────────────────────────
156+
157+
/// Bench parsing of a compound response (chained SMB2 messages in one frame).
158+
/// Compound responses are the wire format for create+read+close and similar
159+
/// batched operations — relevant CPU cost when the S3 layer issues compounds.
160+
fn bench_parse_compound_response(c: &mut Criterion) {
161+
let mut group = c.benchmark_group("parse_compound_response");
162+
// Body size of 16 bytes is representative of close/create-response payloads.
163+
let body_len = 16usize;
164+
let entry_size = SMB2_HEADER_SIZE + body_len;
165+
for n in [2usize, 4, 8] {
166+
let mut data = vec![0u8; entry_size * n];
167+
for i in 0..n {
168+
let mut hdr = Header::new(Command::Create, i as u64);
169+
hdr.next_command = if i + 1 < n { entry_size as u32 } else { 0 };
170+
let mut buf = BytesMut::with_capacity(SMB2_HEADER_SIZE);
171+
hdr.encode(&mut buf);
172+
let start = i * entry_size;
173+
data[start..start + SMB2_HEADER_SIZE].copy_from_slice(&buf);
174+
for b in &mut data[start + SMB2_HEADER_SIZE..start + entry_size] {
175+
*b = 0xAB;
176+
}
177+
}
178+
group.bench_with_input(
179+
criterion::BenchmarkId::from_parameter(n),
180+
&data,
181+
|b, data| b.iter(|| parse_compound_response(black_box(data))),
182+
);
183+
}
184+
group.finish();
185+
}
186+
187+
/// Build one framed SMB2 read response message (header + read response body +
188+
/// data) ready for `Header::decode` + `decode_read_response_owned`.
189+
fn build_read_response_msg(msg_id: u64, data_len: usize) -> Vec<u8> {
190+
let body_len = 16 + data_len;
191+
let mut msg = vec![0u8; SMB2_HEADER_SIZE + body_len];
192+
193+
let mut hdr_buf = BytesMut::with_capacity(SMB2_HEADER_SIZE);
194+
let mut hdr = Header::new(Command::Read, msg_id);
195+
hdr.status = 0;
196+
hdr.encode(&mut hdr_buf);
197+
msg[..SMB2_HEADER_SIZE].copy_from_slice(&hdr_buf);
198+
199+
let body = &mut msg[SMB2_HEADER_SIZE..];
200+
// StructureSize = 17
201+
body[0..2].copy_from_slice(&17u16.to_le_bytes());
202+
// DataOffset (from start of SMB2 message)
203+
let data_offset = (SMB2_HEADER_SIZE + 16) as u16;
204+
body[2..4].copy_from_slice(&data_offset.to_le_bytes());
205+
// DataLength
206+
body[4..8].copy_from_slice(&(data_len as u32).to_le_bytes());
207+
// Remaining 8 bytes (DataRemaining + Flags) stay zero. Data bytes stay zero.
208+
msg
209+
}
210+
211+
/// Bench the CPU-bound per-batch work of `pipelined_read`: header decode,
212+
/// slot computation from message_id, and `decode_read_response_owned`. This
213+
/// is the inner loop of GetObject streaming once the wire bytes are in.
214+
fn bench_pipelined_read_decode(c: &mut Criterion) {
215+
let mut group = c.benchmark_group("pipelined_read_decode");
216+
// (depth, chunk_size) — depth=64 matches PIPELINE_DEPTH in ops.rs.
217+
let cases = [(8usize, 65536usize), (64, 65536), (64, 8192)];
218+
for (depth, chunk_size) in cases {
219+
let base_msg_id = 1_000u64;
220+
let messages: Vec<Vec<u8>> = (0..depth)
221+
.map(|i| build_read_response_msg(base_msg_id + i as u64, chunk_size))
222+
.collect();
223+
group.throughput(criterion::Throughput::Bytes((depth * chunk_size) as u64));
224+
group.bench_with_input(
225+
criterion::BenchmarkId::from_parameter(format!("d{depth}_c{chunk_size}")),
226+
&messages,
227+
|b, messages| {
228+
b.iter(|| {
229+
let n = messages.len();
230+
let mut slots: Vec<Option<bytes::Bytes>> = (0..n).map(|_| None).collect();
231+
for msg in messages.iter() {
232+
let header = Header::decode(black_box(msg)).unwrap();
233+
let slot = header.message_id.wrapping_sub(base_msg_id) as usize;
234+
let body = msg[SMB2_HEADER_SIZE..].to_vec();
235+
slots[slot] = decode_read_response_owned(body);
236+
}
237+
slots
238+
});
239+
},
240+
);
241+
}
242+
group.finish();
243+
}
244+
245+
/// Bench the CPU-bound per-batch work of `pipelined_write`: header construction
246+
/// (with credit charge), `encode_write_request`, and `build_request` framing.
247+
/// This is the inner loop of WAL pipelined writes before any I/O happens.
248+
fn bench_pipelined_write_encode(c: &mut Criterion) {
249+
let mut group = c.benchmark_group("pipelined_write_encode");
250+
let file_id = [1u8; 16];
251+
// (depth, chunk_size) — depth=64 matches WRITE_PIPELINE_DEPTH in ops.rs.
252+
let cases = [(8usize, 65536usize), (64, 65536), (64, 1024 * 1024)];
253+
for (depth, chunk_size) in cases {
254+
let chunk = vec![0u8; chunk_size];
255+
group.throughput(criterion::Throughput::Bytes((depth * chunk_size) as u64));
256+
group.bench_with_input(
257+
criterion::BenchmarkId::from_parameter(format!("d{depth}_c{chunk_size}")),
258+
&chunk,
259+
|b, chunk| {
260+
b.iter(|| {
261+
let mut packets = Vec::with_capacity(depth);
262+
let mut offset = 0u64;
263+
for i in 0..depth {
264+
let mut hdr = Header::new(Command::Write, i as u64)
265+
.with_credit_charge(chunk.len() as u32);
266+
hdr.tree_id = 42;
267+
hdr.session_id = 0xdead_beef;
268+
let packet = build_request(&hdr, |buf| {
269+
encode_write_request(buf, &file_id, offset, black_box(chunk));
270+
});
271+
packets.push(packet);
272+
offset += chunk.len() as u64;
273+
}
274+
packets
275+
});
276+
},
277+
);
278+
}
279+
group.finish();
280+
}
281+
155282
fn bench_parse_directory_entries(c: &mut Criterion) {
156283
// Build 50 entries
157284
let mut data = Vec::new();
@@ -192,6 +319,9 @@ criterion_group!(
192319
bench_decode_read_response,
193320
bench_decode_read_response_owned,
194321
bench_build_request,
322+
bench_parse_compound_response,
323+
bench_pipelined_read_decode,
324+
bench_pipelined_write_encode,
195325
bench_parse_directory_entries,
196326
);
197327
criterion_main!(benches);

scripts/stress-concurrent.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ SPICEIO_PID=""
3434
cleanup() {
3535
echo ""
3636
echo "[stress] cleaning up..."
37-
aws --endpoint-url "$ENDPOINT" --no-sign-request \
37+
aws --endpoint-url "$ENDPOINT" --no-sign-request --region "$REGION" \
3838
s3 rm "s3://${BUCKET}/${PREFIX}/" --recursive --quiet 2>/dev/null || true
3939
if [[ -n "$SPICEIO_PID" ]]; then
4040
kill "$SPICEIO_PID" 2>/dev/null || true

0 commit comments

Comments
 (0)