Skip to content

Commit e31c63c

Browse files
authored
Fix sharing-violation 500s + carry forward review fixes (v0.5.4) (#23)
* Fix sharing-violation 500s + carry forward review fixes (v0.5.4) The previous PR squash-merged with four Copilot review threads unaddressed (auto-merge fired before the fixes landed). Brings those forward, and adds the actual bug they were a symptom of — sharing violations leaking through io_to_s3_error's catch-all arm as 500 InternalError with noisy `error:` log lines. Sharing violations are an expected transient under concurrent writes (the SMB server refuses a file held by another handle). The typed io::ErrorKind::ResourceBusy mapping was already correct; the HTTP layer just had no arm for it. Now maps to 503 SlowDown — the standard retryable S3 status — and logs via slog! instead of serr!, so SDKs auto-retry instead of seeing a fatal 500 and CI stops painting these red. Adds an io_to_s3_error unit test covering all four kinds (NotFound, PermissionDenied, ResourceBusy, fallback) so this regression has a test gate next time. Adds a stderr guard to the three CI integration scripts (test-sccache, test-extended, stress-concurrent) that fails the run if spiceio emits `[spiceio] error:` lines. Expected transients log through slog! without the prefix, so any hit means a new failure mode is leaking through the 500 InternalError arm. Would have caught the original regression. Review fixes carried forward from edc1579: - decode_read_response_from_msg rejects data_offset < SMB2_HEADER_SIZE so a malformed server response cannot slice into the SMB2 header. Adds two unit tests. - GetObject streaming channel capped by an 8 MiB per-request memory budget on top of READ_PIPELINE_DEPTH; at 1 MiB chunks the channel falls from 64 to 8 instead of buffering up to 64 MiB. - bench-live.sh mount_smbfs baseline no longer embeds the SMB password in argv (leaked via shell history / ps) and no longer shells out to perl -MURI::Escape (not in minimal Perl installs). Uses macOS Keychain credentials; skips clearly if not configured. * Address PR review feedback - stream_channel_capacity: extracted from the inline GetObject channel sizing into a tested helper. Floors chunk_size at 1 before the divide-by, so an SMB-negotiated max_read_size of 0 cannot panic the streaming task with a divide-by-zero. Four new unit tests cover the zero-chunk guard plus the default 64 KiB, large 1 MiB, and budget- exceeding cases. - Integration test stderr guards: kill+wait the spiceio PID(s) explicitly before grepping the captured stderr. The previous `sync` was a no-op for the tee process-substitution pipe, and the cleanup trap hadn't fired yet so spiceio's stderr stream was still open. test-sccache.sh has two spiceio instances; the guard now kills both. Misleading comment removed.
1 parent 5736b5f commit e31c63c

8 files changed

Lines changed: 289 additions & 19 deletions

File tree

Cargo.lock

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "spiceio"
3-
version = "0.5.3"
3+
version = "0.5.4"
44
edition = "2024"
55
description = "S3-compatible API proxy to SMB file shares"
66
license = "Apache-2.0"

scripts/bench-live.sh

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -193,18 +193,27 @@ bench_concurrent_read() {
193193
# and runs the same dd-based write/read tests. Establishes the hard
194194
# ceiling for what the link can do, so we can attribute spiceio's
195195
# translation overhead.
196+
#
197+
# The mount uses the macOS Keychain for credentials rather than embedding
198+
# the password in the mount URL — embedded passwords leak via shell
199+
# history and `ps` argv listings, and shelling out to `perl -MURI::Escape`
200+
# adds a non-portable dependency. To prep the keychain once:
201+
# mount_smbfs //user@server/share /tmp/mnt # interactive; offers
202+
# "Remember in keychain"
203+
# umount /tmp/mnt
204+
# If the keychain entry is missing, the baseline is skipped.
196205
bench_mount_baseline() {
197206
local user="$SPICEIO_SMB_USER"
198-
local pass="$SPICEIO_SMB_PASS"
199207
local server="$SMB_SERVER"
200208
local share="$SMB_SHARE"
201209

202210
MOUNT_POINT="/tmp/spiceio-bench-mount-$$"
203211
mkdir -p "$MOUNT_POINT"
204-
local escaped_pass
205-
escaped_pass=$(printf '%s' "$pass" | perl -MURI::Escape -ne 'print uri_escape($_)')
206-
if ! mount_smbfs -N "//${user}:${escaped_pass}@${server}/${share}" "$MOUNT_POINT" 2>/dev/null; then
207-
echo " (mount_smbfs failed — skipping baseline)"
212+
# -N: no interactive password prompt. mount_smbfs will pull credentials
213+
# from the macOS Keychain if available; we deliberately do not pass the
214+
# password in argv so it doesn't leak via process listings.
215+
if ! mount_smbfs -N "//${user}@${server}/${share}" "$MOUNT_POINT" 2>/dev/null; then
216+
echo " (mount_smbfs failed — set up keychain credentials first; skipping baseline)"
208217
rmdir "$MOUNT_POINT" 2>/dev/null
209218
MOUNT_POINT=""
210219
return

scripts/stress-concurrent.sh

Lines changed: 41 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,12 @@ FAIL=0
2828
CONCURRENCY="${SPICEIO_STRESS_CONCURRENCY:-16}"
2929
CURL_TIMEOUT="${SPICEIO_STRESS_TIMEOUT:-30}"
3030

31+
# Capture spiceio stderr for a post-run guard. We still tee to the script's
32+
# stderr so CI logs stream live; the file is only used to fail the build if
33+
# any unexpected `[spiceio] error:` lines slip through (e.g., something we
34+
# thought was an expected transient turns out to be logged as an error).
35+
SPICEIO_STDERR="${TMPDIR_BASE}/spiceio-stderr.log"
36+
3137
# ── Cleanup ────────────────────────────────────────────────────────────────
3238

3339
SPICEIO_PID=""
@@ -95,7 +101,7 @@ SPICEIO_SMB_DOMAIN="$SMB_DOMAIN" \
95101
SPICEIO_SMB_SHARE="$SMB_SHARE" \
96102
SPICEIO_BUCKET="$BUCKET" \
97103
SPICEIO_REGION="$REGION" \
98-
"$SPICEIO_BIN" &
104+
"$SPICEIO_BIN" 2> >(tee "$SPICEIO_STDERR" >&2) &
99105
SPICEIO_PID=$!
100106

101107
for i in $(seq 1 30); do
@@ -394,6 +400,40 @@ for i in $(seq 1 "$LARGE_N"); do
394400
done
395401
echo " integrity: $((PASS - PREV_PASS))/${LARGE_N} verified"
396402

403+
# ════════════════════════════════════════════════════════════════════════════
404+
# Stderr guard
405+
# ════════════════════════════════════════════════════════════════════════════
406+
#
407+
# Any `[spiceio] error:` lines indicate spiceio surfaced something at error
408+
# level — that should not happen during a clean stress run. Expected
409+
# transients (sharing violations, NotFound, etc.) are mapped to typed
410+
# `io::ErrorKind`s and logged via `slog!` (without the "error:" prefix).
411+
# A hit here means we regressed: a new failure mode is leaking through the
412+
# generic 500 InternalError arm of `io_to_s3_error`. Fail the build so
413+
# someone investigates instead of merging it silently again.
414+
415+
# Stop spiceio *before* grepping the captured stderr. spiceio's stderr is
416+
# piped through `tee` running in a process substitution; the tee process
417+
# only finishes draining and closes the capture file when spiceio closes its
418+
# stderr fd. The cleanup trap runs on EXIT, so it would fire too late for
419+
# the guard below — we have to kill+wait here. Set SPICEIO_PID empty so the
420+
# trap's own kill becomes a no-op.
421+
if [[ -n "$SPICEIO_PID" ]]; then
422+
kill "$SPICEIO_PID" 2>/dev/null || true
423+
wait "$SPICEIO_PID" 2>/dev/null || true
424+
SPICEIO_PID=""
425+
fi
426+
# tee in process substitution isn't directly waitable from bash, so give it
427+
# a brief moment to drain after spiceio closed its stderr.
428+
sleep 0.2
429+
430+
if [[ -s "$SPICEIO_STDERR" ]] && grep -q '\[spiceio\] error:' "$SPICEIO_STDERR"; then
431+
echo ""
432+
echo "FAIL: spiceio emitted unexpected error log lines during stress run:"
433+
grep '\[spiceio\] error:' "$SPICEIO_STDERR" | head -20
434+
FAIL=$((FAIL + 1))
435+
fi
436+
397437
# ════════════════════════════════════════════════════════════════════════════
398438
# Summary
399439
# ════════════════════════════════════════════════════════════════════════════

scripts/test-extended.sh

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,10 @@ FAIL=0
3131
CONCURRENCY="${SPICEIO_EXT_CONCURRENCY:-8}"
3232
CURL_TIMEOUT="${SPICEIO_EXT_TIMEOUT:-60}"
3333

34+
# Capture spiceio stderr so the post-run guard can flag unexpected
35+
# `[spiceio] error:` lines. We tee back to stderr so CI still streams live.
36+
SPICEIO_STDERR="${TMPDIR_BASE}/spiceio-stderr.log"
37+
3438
# ── Cleanup ────────────────────────────────────────────────────────────────
3539

3640
SPICEIO_PID=""
@@ -113,7 +117,7 @@ SPICEIO_SMB_DOMAIN="$SMB_DOMAIN" \
113117
SPICEIO_SMB_SHARE="$SMB_SHARE" \
114118
SPICEIO_BUCKET="$BUCKET" \
115119
SPICEIO_REGION="$REGION" \
116-
"$SPICEIO_BIN" &
120+
"$SPICEIO_BIN" 2> >(tee "$SPICEIO_STDERR" >&2) &
117121
SPICEIO_PID=$!
118122

119123
for i in $(seq 1 60); do
@@ -427,6 +431,33 @@ ORIG_MD5=$(md5 -q "${SC_DIR}/src")
427431
GOT_MD5=$(md5 -q "${SC_DIR}/dl-after")
428432
assert_eq "full GET after cancellation succeeds" "$ORIG_MD5" "$GOT_MD5"
429433

434+
# ════════════════════════════════════════════════════════════════════════════
435+
# Stderr guard
436+
# ════════════════════════════════════════════════════════════════════════════
437+
#
438+
# Expected transients (sharing violations, NotFound, etc.) are mapped to
439+
# typed `io::ErrorKind`s and logged via `slog!` without the "error:" prefix.
440+
# A `[spiceio] error:` hit here means a new failure mode is leaking through
441+
# the generic 500 InternalError arm of `io_to_s3_error` — fail the build.
442+
443+
# Stop spiceio *before* grepping the captured stderr. tee in the process
444+
# substitution only closes the capture file when spiceio's stderr fd closes,
445+
# so we have to kill+wait here rather than rely on the EXIT trap. Clear
446+
# SPICEIO_PID so the trap's own kill becomes a no-op.
447+
if [[ -n "$SPICEIO_PID" ]]; then
448+
kill "$SPICEIO_PID" 2>/dev/null || true
449+
wait "$SPICEIO_PID" 2>/dev/null || true
450+
SPICEIO_PID=""
451+
fi
452+
sleep 0.2 # tee in <(...) isn't directly waitable; let it drain.
453+
454+
if [[ -s "$SPICEIO_STDERR" ]] && grep -q '\[spiceio\] error:' "$SPICEIO_STDERR"; then
455+
echo ""
456+
echo "FAIL: spiceio emitted unexpected error log lines during extended run:"
457+
grep '\[spiceio\] error:' "$SPICEIO_STDERR" | head -20
458+
FAIL=$((FAIL + 1))
459+
fi
460+
430461
# ════════════════════════════════════════════════════════════════════════════
431462
# Summary
432463
# ════════════════════════════════════════════════════════════════════════════

scripts/test-sccache.sh

Lines changed: 37 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,10 @@ TEST_PREFIX="spiceio-test-$$"
2424
PASS=0
2525
FAIL=0
2626

27+
# Capture spiceio stderr so the post-run guard can flag unexpected
28+
# `[spiceio] error:` lines. We tee back to stderr so CI still streams live.
29+
SPICEIO_STDERR=$(mktemp /tmp/spiceio-sccache-stderr.XXXXXX)
30+
2731
# ── Cleanup on exit ─────────────────────────────────────────────────────────
2832

2933
SPICEIO_PID=""
@@ -43,6 +47,7 @@ cleanup() {
4347
wait "$SPICEIO_PID" 2>/dev/null || true
4448
fi
4549
rm -rf /tmp/spiceio-test-*
50+
rm -f "$SPICEIO_STDERR"
4651
}
4752
trap cleanup EXIT
4853

@@ -116,7 +121,7 @@ SPICEIO_SMB_DOMAIN="$SMB_DOMAIN" \
116121
SPICEIO_SMB_SHARE="$SMB_SHARE" \
117122
SPICEIO_BUCKET="$BUCKET" \
118123
SPICEIO_REGION="$REGION" \
119-
"$SPICEIO_BIN" &
124+
"$SPICEIO_BIN" 2> >(tee -a "$SPICEIO_STDERR" >&2) &
120125
SPICEIO_PID=$!
121126

122127
echo "[test] waiting for spiceio on ${BIND}..."
@@ -281,7 +286,7 @@ SPICEIO_SMB_SHARE="$SMB_SHARE" \
281286
SPICEIO_BUCKET="$BUCKET" \
282287
SPICEIO_REGION="$REGION" \
283288
SPICEIO_LOG_FILE="$SPICEIO_LOG2" \
284-
"$SPICEIO_BIN" &
289+
"$SPICEIO_BIN" 2> >(tee -a "$SPICEIO_STDERR" >&2) &
285290
SPICEIO_PID2=$!
286291

287292
echo "[test] waiting for second spiceio instance..."
@@ -404,3 +409,33 @@ else
404409
echo "[test] FAIL: expected cache hits > 0 (got ${CACHE_HITS:-0}) and write errors == 0 (got ${WRITE_ERRORS:-0})"
405410
exit 1
406411
fi
412+
413+
# ── Stderr guard ────────────────────────────────────────────────────────────
414+
#
415+
# Any `[spiceio] error:` line means a code path is leaking through the
416+
# generic 500 InternalError arm of `io_to_s3_error`. Expected transients
417+
# (NotFound on HEAD probes, sharing violations, etc.) are mapped to typed
418+
# `io::ErrorKind`s and logged via `slog!` without the "error:" prefix.
419+
420+
# Stop both spiceio instances *before* grepping the captured stderr. tee in
421+
# the process substitutions only closes the capture file when each spiceio
422+
# closes its stderr fd, so we have to kill+wait here rather than rely on
423+
# the EXIT trap. Clear PIDs so the trap's own kills become no-ops.
424+
if [[ -n "$SPICEIO_PID2" ]]; then
425+
kill "$SPICEIO_PID2" 2>/dev/null || true
426+
wait "$SPICEIO_PID2" 2>/dev/null || true
427+
SPICEIO_PID2=""
428+
fi
429+
if [[ -n "$SPICEIO_PID" ]]; then
430+
kill "$SPICEIO_PID" 2>/dev/null || true
431+
wait "$SPICEIO_PID" 2>/dev/null || true
432+
SPICEIO_PID=""
433+
fi
434+
sleep 0.2 # tee in <(...) isn't directly waitable; let it drain.
435+
436+
if [[ -s "$SPICEIO_STDERR" ]] && grep -q '\[spiceio\] error:' "$SPICEIO_STDERR"; then
437+
echo ""
438+
echo "[test] FAIL: spiceio emitted unexpected error log lines:"
439+
grep '\[spiceio\] error:' "$SPICEIO_STDERR" | head -20
440+
exit 1
441+
fi

src/s3/router.rs

Lines changed: 124 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -413,6 +413,24 @@ async fn handle_list_objects(
413413

414414
// ── GetObject (streaming, with Range + Conditional) ─────────────────────────
415415

416+
/// Compute the SMB-reads→HTTP-writes channel capacity for a streaming
417+
/// GetObject given the SMB-negotiated chunk size.
418+
///
419+
/// Sized to hold one full SMB read pipeline batch (so back-to-back batches
420+
/// can overlap) but capped by an 8 MiB per-request memory budget so a
421+
/// configured `SPICEIO_SMB_MAX_IO` of 1 MiB doesn't blow up to 64 MiB of
422+
/// buffering per concurrent stream.
423+
///
424+
/// Guards against `chunk_size = 0`: `handle.max_chunk` ultimately comes from
425+
/// the SMB server's `max_read_size` in the negotiate response. If a server
426+
/// (or a misconfiguration) yields 0, naive `BUDGET / chunk_size` would
427+
/// divide-by-zero panic and crash the streaming task. We floor at 1 here.
428+
fn stream_channel_capacity(chunk_size: u32) -> usize {
429+
const STREAM_CHANNEL_MAX_BYTES: usize = 8 * 1024 * 1024;
430+
let chunk_size_for_cap = (chunk_size as usize).max(1);
431+
(STREAM_CHANNEL_MAX_BYTES / chunk_size_for_cap).clamp(1, crate::smb::ops::READ_PIPELINE_DEPTH)
432+
}
433+
416434
async fn handle_get_object(
417435
hdrs: &http::HeaderMap,
418436
share: &ShareSession,
@@ -593,14 +611,18 @@ async fn handle_get_object(
593611

594612
// Build response with streaming body.
595613
//
596-
// Channel capacity is sized to match the SMB pipeline depth so a full
597-
// batch of reads can dump into the channel without blocking the producer.
598-
// That lets the SMB-reading task immediately issue the next pipelined
599-
// batch (incurring its round-trip) while the HTTP-sending task drains
600-
// the previous batch into the wire — back-to-back batches overlap, which
601-
// is the difference between filling and starving the 10G link.
602-
let (body, tx) = SpiceioBody::channel(crate::smb::ops::READ_PIPELINE_DEPTH);
614+
// Channel capacity is sized so a full SMB pipeline batch can dump into
615+
// the channel without blocking the producer — that lets the SMB-reading
616+
// task immediately issue the next pipelined round-trip while the
617+
// HTTP-sending task drains the previous batch, overlapping back-to-back
618+
// batches. We also cap by a per-request memory budget: with a configured
619+
// `SPICEIO_SMB_MAX_IO` of 1 MiB and `READ_PIPELINE_DEPTH = 64`, an
620+
// uncapped channel would buffer up to 64 MiB per concurrent GetObject.
621+
// At default 64 KiB chunks the channel stays at the full pipeline depth
622+
// (4 MiB); at 1 MiB chunks it falls to 8 (still room to overlap).
603623
let chunk_size = handle.max_chunk;
624+
let channel_cap = stream_channel_capacity(chunk_size);
625+
let (body, tx) = SpiceioBody::channel(channel_cap);
604626

605627
// Spawn background task to stream pipelined SMB reads into the channel.
606628
// Sends batches of read requests to fill the network pipe, then pushes
@@ -1452,6 +1474,19 @@ fn io_to_s3_error(e: &io::Error) -> Response<SpiceioBody> {
14521474
crate::serr!("[spiceio] access denied: {e}");
14531475
error_response(StatusCode::FORBIDDEN, "AccessDenied", "Access Denied")
14541476
}
1477+
// Sharing violations are an expected transient under concurrent
1478+
// writes (the underlying SMB server refused to open a file held by
1479+
// another handle). Map to 503 SlowDown — the standard S3 retryable
1480+
// status — and log at info level instead of `serr!`, since these
1481+
// are not bugs and shouldn't paint CI logs red.
1482+
io::ErrorKind::ResourceBusy => {
1483+
crate::slog!("[spiceio] busy (retry): {e}");
1484+
error_response(
1485+
StatusCode::SERVICE_UNAVAILABLE,
1486+
"SlowDown",
1487+
"Please reduce your request rate.",
1488+
)
1489+
}
14551490
_ => {
14561491
crate::serr!("[spiceio] error: {e}");
14571492
error_response(
@@ -1510,3 +1545,85 @@ async fn collect_body(req: Request<Incoming>) -> Bytes {
15101545
}
15111546
}
15121547
}
1548+
1549+
#[cfg(test)]
1550+
mod tests {
1551+
use super::*;
1552+
1553+
/// The whole reason this file exists: unit tests in `smb::client` already
1554+
/// verify `STATUS_SHARING_VIOLATION → io::ErrorKind::ResourceBusy`. What
1555+
/// was missing was the next hop — that the HTTP layer maps that kind to
1556+
/// a retryable S3 status, not 500 InternalError. Without this assertion,
1557+
/// the prior regression (sharing violations leaking as 500 with noisy
1558+
/// `error:` logs) had no test to catch it.
1559+
#[test]
1560+
fn io_to_s3_error_maps_resource_busy_to_slow_down() {
1561+
let err = io::Error::new(io::ErrorKind::ResourceBusy, "sharing violation: foo");
1562+
let resp = io_to_s3_error(&err);
1563+
assert_eq!(resp.status(), StatusCode::SERVICE_UNAVAILABLE);
1564+
}
1565+
1566+
#[test]
1567+
fn io_to_s3_error_maps_not_found_to_404() {
1568+
let err = io::Error::new(io::ErrorKind::NotFound, "missing");
1569+
let resp = io_to_s3_error(&err);
1570+
assert_eq!(resp.status(), StatusCode::NOT_FOUND);
1571+
}
1572+
1573+
#[test]
1574+
fn io_to_s3_error_maps_permission_denied_to_403() {
1575+
let err = io::Error::new(io::ErrorKind::PermissionDenied, "nope");
1576+
let resp = io_to_s3_error(&err);
1577+
assert_eq!(resp.status(), StatusCode::FORBIDDEN);
1578+
}
1579+
1580+
#[test]
1581+
fn io_to_s3_error_other_falls_back_to_500() {
1582+
let err = io::Error::other("unexpected");
1583+
let resp = io_to_s3_error(&err);
1584+
assert_eq!(resp.status(), StatusCode::INTERNAL_SERVER_ERROR);
1585+
}
1586+
1587+
// ── stream_channel_capacity ─────────────────────────────────────────────
1588+
1589+
#[test]
1590+
fn stream_channel_capacity_handles_zero_chunk() {
1591+
// The SMB negotiate response is the source of `max_chunk`; an
1592+
// unusual server (or a misconfiguration) could yield 0. The naive
1593+
// `BUDGET / 0` would panic and crash the streaming task — the
1594+
// floor-at-1 inside `stream_channel_capacity` keeps it safe. The
1595+
// exact value doesn't matter (a 0-byte chunk is nonsense anyway);
1596+
// we only assert (a) we don't panic and (b) the result is a sane
1597+
// pipeline-depth-bounded slot count.
1598+
let cap = stream_channel_capacity(0);
1599+
assert!(
1600+
(1..=crate::smb::ops::READ_PIPELINE_DEPTH).contains(&cap),
1601+
"expected 1..={} got {cap}",
1602+
crate::smb::ops::READ_PIPELINE_DEPTH
1603+
);
1604+
}
1605+
1606+
#[test]
1607+
fn stream_channel_capacity_default_chunk_uses_full_pipeline() {
1608+
// 64 KiB chunks → 8 MiB / 64 KiB = 128, clamped to the pipeline
1609+
// depth so a full SMB read batch can dump into the channel.
1610+
assert_eq!(
1611+
stream_channel_capacity(65536),
1612+
crate::smb::ops::READ_PIPELINE_DEPTH
1613+
);
1614+
}
1615+
1616+
#[test]
1617+
fn stream_channel_capacity_large_chunk_falls_below_pipeline() {
1618+
// 1 MiB chunks → 8 MiB / 1 MiB = 8, well below pipeline depth.
1619+
// This is the case the per-request memory budget exists to handle.
1620+
assert_eq!(stream_channel_capacity(1024 * 1024), 8);
1621+
}
1622+
1623+
#[test]
1624+
fn stream_channel_capacity_huge_chunk_floors_at_one() {
1625+
// A chunk bigger than the entire budget still yields a 1-slot
1626+
// channel, never 0 — the producer can always make progress.
1627+
assert_eq!(stream_channel_capacity(16 * 1024 * 1024), 1);
1628+
}
1629+
}

0 commit comments

Comments
 (0)