Add an in-process observability surface so consumers (applications, AI agents, humans behind a UI) can ask a running
SmbClient what's going on inside: credits, in-flight requests, negotiated parameters, per-connection counters, DFS
cache state, sessions. One snapshot type, one tree of data, two render formats — Display for a terminal, optional
serde::Serialize for everything else.
This doc is the source of truth during implementation. When something drifts, update this doc rather than the code. Plan v2 incorporates fresh-eyes review feedback (see § "Plan v2 changes" at the bottom).
The library has rich internal state but no way to ask "what's the current state?" without grepping logs or adding ad-hoc
println!s. The recent CHANGE_NOTIFY loss-window investigation (commits e3a35f1 → cbe94df) would have been minutes
instead of days if oplock_breaks_received, status_pending_loops, and a current "in-flight requests" gauge had been
queryable. Same for DFS failovers, compound splits (QNAP/Samba), signature verification failures, and credit starvation.
Why snapshot-only (and not an event stream), even though cmdr's principles say "subscribe, don't poll":
- State-shaped, not event-shaped. What consumers want to see — credits, in-flight count, negotiated dialect, per-counter totals — is current state, not a sequence of past events.
- Event-shaped things already exist. Oplock breaks, session expiry, DFS failovers, decrypt failures all log through
the
logfacade today. Consumers wanting a temporal view subscribe tolog(tracing,env_logger, or a custom appender) at the application layer. - Polling cost is genuinely nothing. Snapshot is a handful of atomic loads + a few short critical sections under
StdMutex(uncontended typical wait <100 ns). Polling at 1 Hz for a dashboard is below noise. - An event channel is the kind of API mistake you can't take back. Add it once, every consumer wires their own subscriber, the channel's bounded/unbounded/drop-policy is a forever decision. Adding it later when there's a real workload is strictly safer than adding it now and discovering nobody uses it.
If a real event-driven workload appears later, diagnostics() and a future event_stream() coexist cleanly.
Consumers that benefit:
cmdrwants a "connection details" dev panel and an MCP tool returning the same snapshot. Smart-backend/thin-frontend: smb2 owns the data, cmdr does the rendering.- Agents diagnosing real SMB issues over MCP need a single structured tree they can read once, not a log scrape.
- Tests can replace fragile log-grep assertions with typed ones (
assert!(diag.primary.metrics.status_pending_loops > 0)). - Operators (humans behind
examples/diagnostics.rsor a futuresmb2-diagCLI) get a readable terminal dump.
- No web frontend, no HTML, no JS — wrong layer for a protocol library.
- No
tracingintegration, notokio-consolewiring. These belong in consumers; smb2 keeps thelogfacade. - No async event channel, broadcast bus, or callback hooks. See "Why" above.
- No reset/clear API for counters. Monotonic is easier; consumers diff two snapshots for rate.
- No per-NTSTATUS error histogram. Too noisy for now; can be added later as a separate optional field.
- No exposing key material. Lengths, algorithm IDs, and "active/inactive" flags only — same rule as the existing logger.
- No tracking of caller-owned objects (open
FileIds, liveFileDownload/FileUpload/FileWriter/Watcherinstances). The library doesn't own these — the caller does. Forcing registration back would requireArcsoup. Document that consumers fold their own stream handles into their own diagnostics view.
Diagnostics ← SmbClient::diagnostics()
├── client : ClientInfo ← config + ClientMetricsSnapshot
├── primary : ConnectionDiagnostics ← primary connection
│ ├── server, negotiated (Option), credits, signing, encryption, compression,
│ │ rtt_estimate, disconnected, dfs_trees, metrics : MetricsSnapshot,
│ └── session : Option<SessionDiagnostics> ← None until session-setup
├── extra_connections : Vec<ConnectionDiagnostics> ← one per DFS cross-server conn,
│ each with its own .session
└── dfs_cache : Vec<DfsCacheEntry>
ConnectionDiagnostics::session carries the session info per connection, because DFS extra connections each have
their own Session (one auth per target server). There is no top-level session field — client.diagnostics().primary.session
is the conventional path for the main connection's session.
Tree, FileDownload, FileUpload, FileWriter, Watcher already expose per-instance progress (bytes_received,
bytes_written, Progress). The snapshot does not try to enumerate live instances — the library doesn't own them.
Two atomic-bearing structs, internal:
Metricslives on eachInner(per connection).ClientMetricslives on eachSmbClient(above the connection layer).
Both expose one method: snapshot() returning a plain Copy value type. The atomic-bearing types are
crate-private; the snapshot value types are public.
All increments are Relaxed (single uncontended cache line, ~1 ns). Reads are Relaxed. Fields may skew between
samples — this is documented as the eventual-consistency contract on the snapshot.
// Per-connection (per-Inner). Snapshotted as MetricsSnapshot.
struct Metrics {
// Send path
requests_sent: AtomicU64, // every msg_id allocated -- centralized in allocate_msg_id
compound_requests_sent: AtomicU64, // every execute_compound call (chain count, not sub-op count)
wire_bytes_sent: AtomicU64, // bytes written to transport (post-encrypt/compress/sign)
explicit_cancels_sent: AtomicU64, // send_cancel calls (drop-cancel is invisible — Phase-3 E9)
// Receive path
responses_routed_ok: AtomicU64, // sub-frames where waiters.remove → Some(tx) AND tx.send(Ok(frame)) delivered
responses_routed_err: AtomicU64, // sub-frames where waiters.remove → Some(tx) AND tx.send(Err(_)) delivered.
// Today this is the union: signature_failures + session_expired_events.
// Don't sum those WITH this counter; they're a partition of it.
responses_late_after_drop: AtomicU64,// sub-frames where waiters.remove → Some(tx) BUT tx.send fails
// (caller's oneshot::Receiver was dropped — typical for spawn/abort)
responses_stray: AtomicU64, // sub-frames where waiters.remove → None
// (msg_id never registered: server bug, or send-error cleanup race)
wire_bytes_received: AtomicU64, // bytes off transport (pre-decrypt/decompress)
// Protocol events
status_pending_loops: AtomicU64, // interim STATUS_PENDING frames the receiver kept-waiter on
unsolicited_notifications_received: AtomicU64, // MessageId::UNSOLICITED frames
// (today: all oplock breaks; future: lease breaks)
signature_failures: AtomicU64, // signature verify failed — routed to waiter as Err (also ticks responses_routed_err)
decrypt_failures: AtomicU64, // decrypt failed — counted ONCE before fan_error_to_waiters
// (counter survives connection teardown; see invariant below)
decompress_failures: AtomicU64, // decompress failed — counted ONCE before fan
malformed_frames: AtomicU64, // header/parse failed — counted ONCE before fan
// (covers split_compound parse failure + prepare_sub_frame parse failure)
session_expired_events: AtomicU64, // STATUS_NETWORK_SESSION_EXPIRED sub-frames (not session-expiry events).
// A compound of N expired sub-ops ticks N times. For the event-shaped
// signal "did we reconnect" use ClientMetrics::reconnects.
// Subset of responses_routed_err; don't sum.
// Caller-observed outcomes
requests_returned_err: AtomicU64, // execute / execute_with_credits / execute_compound returned outer Err
// to a caller that polled to completion. Per-compound, not per-sub-op.
// Caller-drop is captured by responses_late_after_drop, not here.
}
// Client-level. Snapshotted as ClientMetricsSnapshot.
struct ClientMetrics {
reconnects: AtomicU64, // SmbClient::reconnect() invocations
dfs_referrals_resolved: AtomicU64, // DfsResolver::resolve hit a server (cache miss)
dfs_cache_hits: AtomicU64, // DfsResolver::resolve served from cache
}- Metrics survive teardown.
Metricslives onInner, which outlivesdisconnected = true. A consumer callingdiagnostics()on a torn-down connection sees the counts at the moment of death. Documented onConnection::diagnostics. - Counters reset across
SmbClient::reconnect(). Reconnect builds a freshConnectionwith a freshInner, so per-connection counters return to zero.ClientMetricssurvive (they live onSmbClient), soreconnectsis monotonic across the client's lifetime. Documented onSmbClient::diagnosticsand onDiagnostics. wire_bytes_sentandwire_bytes_receivedmeasure the wire layer, after any sign/encrypt/compress on the send side and before any of those on the receive side. This is the byte count a packet capture would observe. Documented on each field. A futureplaintext_bytes_*pair can land additively if a use case appears.- Fields may skew.
credits.available,credits.in_flight, and each counter is sampled independently. Their sum (e.g.available + in_flight) is not invariant. Documented onCreditInfo.
A "compound response was split across N transport frames" counter would be useful for diagnosing QNAP/Samba splits
(AGENTS.md pitfall #13). Wiring it cleanly is not trivial under the actor model: each sub-op's oneshot::Receiver
resolves independently as its MessageId arrives, so neither the receiver task (no notion of "N expected for this
caller") nor the caller (no view into transport-frame count) naturally has the data.
Spec'ing it concretely would mean adding "source-frame-index" tagging to each routed sub-frame and a check in
execute_compound that all its sub-op responses carry the same index. That's meaningful new plumbing in the demux path.
Decision: drop this counter from the initial diagnostics PR. Track as a follow-up. The DEBUG log line in the receiver loop already covers manual diagnosis today, and a future PR can revisit once the demux path is touched for another reason.
All in a new src/client/diagnostics.rs, re-exported from lib.rs.
Snapshot lock order. The snapshot acquires these locks, one at a time, in this order, and never holds one across
an .await: crypto → waiters → dfs_trees → estimated_rtt. Each is held only as long as it takes to copy
primitives out and release. params (OnceLock) is a wait-free atomic read, not a lock. preauth_hasher and
receiver_task (the other Inner mutexes) are not touched by the snapshot — they're handshake-only / lifecycle-only.
Top-of-file doc comment in diagnostics.rs restates this and warns: "if you add a field that touches a new lock,
extend this order, don't reshuffle it."
pub struct Diagnostics {
pub client: ClientInfo,
pub primary: ConnectionDiagnostics,
pub extra_connections: Vec<ConnectionDiagnostics>,
pub dfs_cache: Vec<DfsCacheEntry>,
}
pub struct ClientInfo {
pub primary_server: String,
pub timeout: Duration,
pub auto_reconnect: bool,
pub dfs_enabled: bool,
pub metrics: ClientMetricsSnapshot,
}
pub struct ConnectionDiagnostics {
pub server: String,
pub negotiated: Option<NegotiatedSummary>, // None until negotiate() runs
pub credits: CreditInfo,
pub signing: SigningInfo,
pub encryption: EncryptionInfo,
pub compression: CompressionInfo,
pub rtt_estimate: Option<Duration>,
pub disconnected: bool,
pub dfs_trees: Vec<TreeId>,
pub session: Option<SessionDiagnostics>, // None until session-setup runs
pub metrics: MetricsSnapshot,
}
pub struct NegotiatedSummary { /* dialect, max_*_size, server_guid, signing_required, capabilities, gmac_negotiated, cipher, compression_supported */ }
pub struct CreditInfo {
pub available: u16,
pub in_flight: usize, // waiters.len(); see "fields may skew" invariant
pub next_message_id: u64, // the id that will be assigned to the next request
}
pub struct SigningInfo { pub active: bool, pub algorithm: Option<SigningAlgorithm> }
pub struct EncryptionInfo { pub active: bool, pub cipher: Option<Cipher> }
pub struct CompressionInfo{ pub requested: bool, pub negotiated: bool }
pub struct SessionDiagnostics {
pub session_id: SessionId,
pub should_sign: bool,
pub should_encrypt: bool,
pub signing_algorithm: SigningAlgorithm,
}
pub struct DfsCacheEntry {
pub path_prefix: String,
pub target_count: usize,
pub expires_in: Option<Duration>, // None if already expired
}
#[derive(Copy, Clone)]
pub struct MetricsSnapshot { /* one u64 per Metrics field */ }
#[derive(Copy, Clone)]
pub struct ClientMetricsSnapshot { /* one u64 per ClientMetrics field */ }Trait bounds: every public diagnostics type is Debug + Clone. Not PartialEq (snapshots aren't equality-shaped),
not Eq, not Deserialize. MetricsSnapshot and ClientMetricsSnapshot are also Copy.
Applied to the top-level container types only: Diagnostics, ClientInfo, ConnectionDiagnostics,
NegotiatedSummary, SessionDiagnostics, DfsCacheEntry, MetricsSnapshot, ClientMetricsSnapshot. These can
plausibly grow. The leaf info types — CreditInfo, SigningInfo, EncryptionInfo, CompressionInfo — are not marked
#[non_exhaustive]: their shape is locked by the protocol. Consumers can synthesize them in tests with struct literals.
Document this distinction in the rustdoc.
impl Connection {
/// Capture an eventually-consistent snapshot of this connection's state and counters.
///
/// Survives connection teardown: a torn-down `Connection` returns its final counter values
/// alongside `disconnected: true`. Fields are sampled independently; values may skew.
pub fn diagnostics(&self) -> ConnectionDiagnostics;
}
impl SmbClient {
/// Capture a tree of diagnostics for the client, its primary and DFS-extra connections,
/// each connection's session, and the DFS referral cache.
///
/// Per-connection counters reset on `reconnect()`. Client-level counters survive.
pub fn diagnostics(&self) -> Diagnostics;
}[features]
serde = ["dep:serde"]When enabled, every public diagnostics type gains #[derive(Serialize)]. No Deserialize — snapshots are
write-only outputs of the library; nothing inside the library reads one back.
The snapshot embeds these existing types. Per-type plan:
| Type | Location | Today's derives | Action |
|---|---|---|---|
Dialect |
types/mod.rs |
num_enum::IntoPrimitive, Debug, Clone, Copy, PartialEq, Eq, Hash |
Add cfg_attr(feature = "serde", derive(Serialize)). Plain enum, trivial. |
SigningAlgorithm |
crypto/signing.rs |
similar | Add cfg_attr. |
Cipher |
crypto/encryption.rs |
similar | Add cfg_attr. |
Capabilities |
types/flags.rs |
bitflags newtype around u32 |
Add a manual Serialize impl (3 lines): serialize as the underlying u32 bits. Document "JSON form is bits, not field list" on the field. |
Guid |
pack/guid.rs |
wire-format struct (mixed-endian) | Add cfg_attr. JSON form is field-shape, not wire-shape — document on the field. |
TreeId, SessionId, MessageId |
types/mod.rs |
newtype u32/u64/u64 |
Add cfg_attr with transparent so JSON form is the bare integer. |
Duration, String, primitives |
std | serde has built-in impls | No action. |
All listed cfg_attr additions are non-breaking even with the feature off (cfg-attr disappears at parse time).
Diagnostics (top level only) gets a Display impl that prints a compact terminal view. The Display impl prints
raw u64 byte counts (no humanization in the lib). Example Display output:
SMB client → 192.168.1.100:445
dialect: SMB 3.1.1 rtt: 2.3 ms
signing: active (AesGmac) encryption: inactive
compression: requested, not negotiated dfs: enabled (cache: 3 entries)
reconnects: 0 dfs hits: 12 / 1 miss
Primary connection (192.168.1.100:445)
credits: 63 available · 2 in flight · next msg_id 1407
wire bytes: 14894592 sent · 328007680 received
responses: 1405 ok · 0 wire-err · 2 late · 0 stray (sent: 1407, caller-err: 2)
protocol events: 8 status-pending · 1 unsolicited
errors: 0 signature · 0 decrypt · 0 decompress · 0 malformed · 0 session-expired
DFS extra connections: (0)
The example binary (examples/diagnostics.rs) applies its local humanize_bytes helper for human display, so its
output looks like wire bytes: 14.2 MiB sent · 312.8 MiB received.
Only one Display impl, on the top level. No Display on ConnectionDiagnostics — consumers who want per-connection
text can render their own, or use the serde feature. Keeps the lib slim.
The Display format is not part of the SemVer contract. For programmatic use, enable the serde feature.
examples/diagnostics.rs: connects to an env-driven server, lists a directory, prints the snapshot.
- Default:
Displayform, withhumanize_bytesapplied to byte counts. --jsonflag: only honored if compiled with--features serde. Otherwise prints a one-line stderr error and exits 2:"--json requires building with --features serde".
Counters at the send sites are factored so they can't drift apart from the code.
-
requests_sentticks insideallocate_msg_id(the one funnel every send path goes through:negotiate,execute_with_credits,execute_with_credits_capturing_request,dispatch_with_credits, andexecute_compound's loop). This catchesdispatchand Watcher's pre-arm CHANGE_NOTIFY, which the v1 plan missed. Note: because the funnel is shared,requests_sentcounts negotiate and session-setup messages too, not just user-issued ops. Document this on the field. -
wire_bytes_sentis collected by a small private helper onInner:async fn send_and_count(&self, bytes: &[u8]) -> Result<()> { self.metrics.wire_bytes_sent.fetch_add(bytes.len() as u64, Relaxed); self.sender.send(bytes).await }
Every
inner.sender.send(...)call site (innegotiate, both branches ofexecute_with_creditsandexecute_with_credits_capturing_request, both branches ofdispatch_with_credits, both branches ofexecute_compound, both branches ofsend_cancel) flips toinner.send_and_count(...). Why factor: v1 listed "four sites plus cancel"; the actual count is ten once compressed/plain and encrypted/plain branches are enumerated. Hand-placing ten bumps invites a missed branch and a leaking counter. The helper is purely additive — sameawait, sameResult. -
compound_requests_sentticks once at the top ofexecute_compound. -
explicit_cancels_sentticks once at the top ofsend_cancel.
-
wire_bytes_receivedticks at the top of the receiver_loop, right aftertransport_recv.receive().awaitreturnsOk(bytes), before decrypt/decompress. -
The four routing outcomes (the
waiters.remove(&msg_id)branch) all tick exactly one counter — disjoint, together covering every routed sub-frame. Naming chosen so the meanings can't blur:Some(tx)+tx.send(Ok(frame))succeeded →responses_routed_ok.Some(tx)+tx.send(Err(_))succeeded (signature failure / session expired branches) →responses_routed_err.Some(tx)+tx.send(...)returnedErr(_)(the caller'soneshot::Receiverwas dropped — the spawn/abort path documented in the actor spec) →responses_late_after_drop. This is the one that catches caller-drop.None(msg_id not in map) →responses_stray. This is the genuine orphan: server sent a frame for an id we never allocated, or send-error cleanup raced with frame arrival.
Together:
responses_routed_ok + responses_routed_err + responses_late_after_drop + responses_straycovers every sub-frame whose msg_id was looked up. (Sub-frames that never reach the lookup — STATUS_PENDING skips, UNSOLICITED skips — have their own counters.) -
status_pending_loopsticks in the STATUS_PENDINGSkipbranch (counted once per interim frame). -
unsolicited_notifications_receivedticks whereMessageId::UNSOLICITEDis matched. -
signature_failuresticks where signature verify fails. Also causesresponses_routed_errto tick when the Err is sent to the waiter (documented on both fields). -
decrypt_failuresticks indecrypt_framefailure branch, beforefan_error_to_waiters. -
decompress_failuresticks in the decompress failure branch (connection.rs:~1325), beforefan_error_to_waiters. -
malformed_framesticks in both thesplit_compoundparse-failure branch and theprepare_sub_frameparse-failure branch, beforefan_error_to_waiters. -
session_expired_eventsticks where STATUS_NETWORK_SESSION_EXPIRED is detected. Also causesresponses_routed_errto tick when the Err is sent to the waiter. If a compound returns N session-expired sub-frames, this counter ticks N times — that's truthful (each sub-frame is one routed expiry event) but documented so consumers aren't surprised.
Intent. Land the cheap, hot-path-touching part in isolation. If something breaks, it's a regression caught here, not muddled with the public surface.
- Add
pub(crate) struct Metrics { … AtomicU64 … }toconnection.rswithDefaultimpl. - Add
metrics: Metricsfield toInner. - Add the increments per § "Send-site / Receive-site counter consolidation" above. No call site renames, no signature
changes — only
+= 1and+= bytes.len()insertions. - Add
pub(crate) struct ClientMetrics { … }toclient/mod.rswithDefaultimpl; field onSmbClient. Increment onreconnect()and in the DFS resolver paths. - Crate-internal
Metrics::snapshot() -> MetricsSnapshotand likewise forClientMetrics. These can bepub(crate)until M2 promotes them. - Tests in
tests/diagnostics_counters.rsexercise each counter viaMockTransport. Each test asserts the counter ticks exactly N times for N expected events. Access via a#[cfg(test)]helper onConnection/SmbClientthat returns the snapshot. Public surface unchanged.
Intent. Lock the public surface.
- New file
src/client/diagnostics.rswith all snapshot types (§ "Snapshot types") and oneDisplayimpl onDiagnostics. Connection::diagnostics() -> ConnectionDiagnostics. Implements the lock order documented in this file's top doc comment. Asserts (via doc, not runtime) that no two locks are held simultaneously and no lock crosses an.await.SmbClient::diagnostics() -> Diagnostics. Walks primary, extras, DFS cache.- Re-export from
lib.rs:Diagnostics,ConnectionDiagnostics,MetricsSnapshot,ClientMetricsSnapshot,ClientInfo,SessionDiagnostics,NegotiatedSummary,CreditInfo,SigningInfo,EncryptionInfo,CompressionInfo,DfsCacheEntry. #[non_exhaustive]on the types listed above (top-level containers); leaf info types stay constructible.- Doc comments call out: eventual consistency, lock order, counter-survives-teardown, reconnect-resets-per-conn-counters, field skew, wire-layer byte semantics.
Intent. Make it nice to use.
Cargo.toml: optionalserdedep withderivefeature;features.serde = ["dep:serde"]. Hidden behind cfg-attr so the default tree is unchanged.- Per-type derives from the audit table above (including the manual
CapabilitiesSerialize impl). examples/diagnostics.rs: env-driven connect, run a couple of ops, printDisplayor--json. Localhumanize_byteshelper.tests/diagnostics_snapshot.rs: shape tests (pre/post negotiate, in-flight, lock-order regression, smokeDisplay, serde round-trip intoserde_json::Value). Test thatDiagnostics: Send + Sync + Clone + Debug.tests/docker_integration.rsadditions (gated#[ignore], hooked intojust test-dockernot consumer): hitsmb-encryption,smb-flaky,smb-slow,smb-dfs-root/target, assert the counters tick. Live indocker_integration.rsto match the existing layout (tests/docker/diagnostics_docker.rswould be a new directory style that doesn't exist here).- README: short "Diagnostics" section under "Quick start" with two code blocks (
Displayand JSON). AGENTS.md: add "Diagnostics" subsection under § Architecture (one paragraph + pointer).src/client/CLAUDE.md: add "Diagnostics" subsection describing the snapshot + counter model + lock order.- CHANGELOG entry. No version bump (user decides).
- Stale CLAUDE.md gotcha.
src/client/CLAUDE.mdsays "Silent frame discard on decrypt/decompress/malformed header: receiver task currently log+continues, hangs the waiter forever." Code (connection.rs:1280, :1311, :1333, :1354, :1381) callsfan_error_to_waiterson all of those paths. Remove the gotcha. Replace with a Decision/Why line about the teardown-on-unrecoverable invariant. - Duplicated STATUS_PENDING bullets in
src/client/CLAUDE.mdGotchas. Collapse into one. Connection::next_message_iddocstring. Change "next id" wording to "the id that will be assigned to the next request" — unambiguous at boundary including the initial 0-state.
Deferred (not load-bearing for diagnostics; tracked in their own follow-ups):
Connection::send_cancelis the lone&mut selfon a Phase-3CloneableConnection. Inconsistent but doesn't block diagnostics. Open a separate issue.Inner::Dropdoesn't synchronously fanErr(Disconnected). The receiver-task's transport-error branch covers it shortly after. Theoretical TOCTOU; flag in a follow-up.Connection::credits()clamps au32tou16; behavior is right, docstring is silent. One-line doc follow-up.
tests/diagnostics_counters.rs:
requests_sentandwire_bytes_senttick for a single execute.compound_requests_sentticks for anexecute_compoundof 3 sub-ops;requests_sentticks by 3 as well (each sub-op allocates a msg_id).requests_returned_errticks when the connection is torn down mid-await and the caller polled to completion.responses_late_after_dropticks for a dropped caller future: spawn anexecute,abort()the task before the response arrives, queue the response on the mock, observe the counter bumps andresponses_straydoes NOT.responses_strayticks for a queued frame with a msg_id that was never allocated.responses_routed_okticks for a normal execute.responses_routed_errticks for a frame with a signature failure (also bumpssignature_failures).explicit_cancels_sentticks forsend_cancel.unsolicited_notifications_receivedticks for a queued frame withMessageId::UNSOLICITED.status_pending_loopsticks for a queued STATUS_PENDING + final response (via the mock-helper chosen in § "Mock STATUS_PENDING helper").decrypt_failuresticks via the existingphase3_decrypt_failure_errors_waiter_not_hangsshape — extended to also assert the counter is1after teardown (proves "counters survive teardown" invariant).decompress_failuresticks for a queued frame with a bogus compression header (mirrors decrypt test shape).malformed_framesticks for a queued frame with a junk SMB2 header (after decrypt).session_expired_eventsticks for a queued STATUS_NETWORK_SESSION_EXPIRED (andresponses_routed_errbumps too).dispatch(Watcher's pre-arm path) bumpsrequests_sentandwire_bytes_sentexactly once per call.- Routing outcomes are disjoint: for N sent requests with K dropped futures and L stray frames, assert
responses_routed_ok + responses_routed_err + responses_late_after_drop + responses_stray ==total sub-frames routed.
tests/diagnostics_snapshot.rs:
- Pre-negotiate connection:
negotiated == None, signing/encryption inactive, counters zero,disconnected == false. - Post-negotiate: dialect, max_*_size, server_guid, etc. populated.
- One in-flight execute (blocked on never-arriving frame):
credits.in_flight == 1. Diagnostics: Send + Sync + Clone + Debug(compile-time + a smokeassert_send_sync!).Displayoutput contains the server name, the dialect, "credits", "wire bytes", "requests" — smoke, not byte-exact.- With
--features serde:serde_json::to_string(&diag)parses back intoserde_json::Value; check a couple of nested keys (primary.credits.available,primary.metrics.requests_sent). - Lock-order regression test: snapshot a connection holding
crypto, thenwaiters, thendfs_trees(a trivial ordering check viaMutex::try_lockafter the snapshot returns — confirms no lock left held).
tests/docker_integration.rs additions:
smb-encryption: snapshot →encryption.active == true,encryption.cipher == Some(...),wire_bytes_received > 0.smb-flaky: force disconnect →disconnected == true,responses_orphanedmay or may not tick,requests_returned_errticks for the in-flight call.smb-slow: read a small file →wire_bytes_received > 0,responses_received >= 1, RTT estimate sensible.smb-dfs-root+smb-dfs-target: first op triggers referral →client.metrics.dfs_referrals_resolved == 1; second op via cache →dfs_cache_hits == 1. Extra connection appears indiagnostics().extra_connections.
tests/integration.rs: one extra test that dumps diagnostics() after a read_file from QNAP — purely smoke, to
verify the field values look sane on a real server. Not asserted on; the test prints and passes if no panic.
None. The snapshot is Relaxed atomic loads + cloned primitives. No parsing.
README.md: add a "Diagnostics" section. Two snippets.AGENTS.md: short "Diagnostics" subsection under § Architecture.src/client/CLAUDE.md: M4 fixes plus a new "Diagnostics model" subsection summarising counters + lock order + invariants.src/client/diagnostics.rstop-of-file doc-comment: full lock order, eventual-consistency contract, "snapshot survives teardown", "counters reset across reconnect", "fields may skew".CHANGELOG.md: entry under Unreleased.
Sequential is fine. Only doc updates can run independently; save them for the end.
| Risk | Mitigation |
|---|---|
| Counter increments on the hot path add overhead. | AtomicU64::fetch_add(_, Relaxed) is single uncontended cache line; ~1 ns. Re-run bench_100_tiny_files_seq_vs_parallel before/after M1; expect noise. If a real regression appears, the candidates are wire_bytes_* (one extra atomic per send/recv); easy to drop. |
| Snapshot consistency: fields skew. | Documented eventual-consistency contract on the public methods. Consumers who need atomicity quiesce ops first. |
serde feature compile-time cost. |
Optional, off by default. Plain derive on plain structs. Measure with cargo check --features serde; budget +50 ms. |
| Field additions break consumers. | #[non_exhaustive] on top-level types. Adding fields is minor; renaming is major. Document on rustdoc. |
Display becomes a parse target. |
Rustdoc says explicitly: "format is human-only and may change; use the serde feature for programmatic access." |
Holding crypto lock briefly for the snapshot contends with the receiver task. |
Receiver task takes the lock to verify signatures. Diagnostics hold it for ~50 ns to copy bool + Option<algo enum> + Option<cipher enum>. Measured negligible; documented. |
| Mock-transport STATUS_PENDING test design. | See § "Mock STATUS_PENDING helper" below — needs a small mock-side extension. |
| Counters reset across reconnect surprises consumers. | Documented on SmbClient::diagnostics. Consumers tracking long-running totals fold per-snapshot deltas into their own counters. |
| The compound-split counter site is more involved than other counters. | If the actor-internal "frames per fulfilled compound" plumbing turns out to be a thicket, downgrade response_splits_observed to a separate follow-up. Not load-bearing for M1/M2; can ship as a zero. |
Today's MockTransport::enable_auto_rewrite_msg_id pops one pending_sent_msg_ids entry per receive() call. A
STATUS_PENDING flow needs two responses (interim PENDING + final) for one sent request — the auto-rewrite FIFO
doesn't fit.
Two viable shapes; pick whichever is cleaner once implementing:
-
queue_pending_then_final(final_bytes: Vec<u8>)— pushes two responses tagged "pair". The mock'sreceive()pops the next sent msg_id once, reuses it for both responses, doesn't pop again on the secondreceive(). Requires a smallResponseRecordenum extension on the mock:{ Single(bytes), PairFirst(bytes), PairSecond(bytes) }so the second member knows to reuse the prior msg_id. -
peek_next_pending_msg_id() -> Option<MessageId>— read-only peek into the FIFO. Test code calls it after thedispatchfuture has registered the waiter, builds both response frames with the peeked msg_id, callsqueue_responsetwice with auto-rewrite OFF. More invasive at the call site, but no enum extension.
Decision goes to whichever lands smaller; M1's STATUS_PENDING test is the only consumer until proven otherwise. The plan tracks the constraint here so the implementer doesn't discover it mid-test.
- New types only.
- Two new public methods (
Connection::diagnostics,SmbClient::diagnostics). Additive. - New optional
serdefeature, off by default. Doesn't affect anyone not opting in. - Per-type
cfg_attr(feature = "serde", derive(Serialize))on a handful of existing public types — additive even with the feature on. #[non_exhaustive]on top-level types so future field additions are minor.
- Reset/clear counters? No. Monotonic. Diff snapshots for rate.
- Per-NTSTATUS histogram? Not now. Possible as a future optional field, gated behind a sub-feature.
- Track open file handles? No. Library doesn't own them. Caller folds in.
- SMB3 multi-channel? Not implemented. "Channels" in smb2 here means open connections (primary + DFS-extra) plus the in-flight set on each.
- Add
tracingspans? No. Keeplog. Atracingfeature can land later without touching diagnostics. - Polling vs event stream? Snapshot only. State is state-shaped; events that exist (oplock break, session expiry, DFS failover) already log. Polling at 1 Hz is essentially free. See "Why".
Connection::diagnostics()andSmbClient::diagnostics()are public, documented, tested.- All counters listed under
MetricsandClientMetricsare wired, including thedispatchpath. Displayimpl renders the example shown above (close — exact wording is editorial).serdefeature compiles green; JSON round-trip test passes;Capabilitiesserializes as bits.examples/diagnostics.rsruns against a Docker container and prints a sensible snapshot.just checkgreen;just check-livegreen;just test-dockergreen.README.md,AGENTS.md,src/client/CLAUDE.md,CHANGELOG.mdupdated; M4 stale gotcha removed.- Counter set documented as eventually consistent, surviving teardown, resetting on reconnect.
Addresses second-round fresh-eyes review (Opus). Material changes:
-
Caller-drop counter rewired to the correct receiver-loop branch. v2 incorrectly attributed caller-drop to the
Nonebranch (msg_id not in map). The actual branch isSome(tx)+tx.send().is_err(). Counter set split:responses_routed_ok—Some(tx)+ delivered Okresponses_routed_err—Some(tx)+ delivered Err (signature / session-expired)responses_late_after_drop—Some(tx)+ send failed (caller-drop)responses_stray—None(true orphan)
Disjoint by construction; their sum equals every sub-frame whose msg_id was looked up.
-
wire_bytes_sentenumerated correctly via a helper. v2 listed "four sites plus cancel" but the actual count is ~10 once compressed/encrypted branches are included. Added a privateInner::send_and_count(bytes)helper that wrapssender.sendand bumps the counter. Everyinner.sender.send(...)call site flips to it. Can't drift. -
response_splits_observeddeferred to a follow-up. v2's site ("caller-side compound assembly") was hand-waved against an actor model that doesn't carry the needed metadata. Wiring it requires "source-frame-index" tagging through the demux — not in scope for this PR. The DEBUG log line in the receiver loop still aids manual diagnosis today. -
decompress_failuresadded. v2 had counters for decrypt and malformed but not for the decompress fatal-frame branch (connection.rs:~1325). Added its own counter so each of the four fatal-frame paths has a distinct attribution. -
malformed_framesscope made explicit. Covers both thesplit_compoundparse-failure branch and theprepare_sub_frameparse-failure branch. Documented on the field. -
requests_sentincludes negotiate / session-setup messages. Funnel-counted viaallocate_msg_id. Documented on the field so consumers aren't surprised by non-zero counts pre-app-traffic. -
requests_returned_errper-compound, not per-sub-op. Disambiguated in the field comment: outer Err only; inner per-sub-op errors don't tick it. -
session_expired_eventsper sub-frame explicitly documented (compound with N expired sub-ops ticks N times). -
Lock order corrected. Removed
params(it'sOnceLock, not a mutex). Listed only the locks the snapshot actually touches. Added a note aboutpreauth_hasher/receiver_tasknot being touched by the snapshot. -
Displayshowcase output rewritten with rawu64byte counts so it matches the "Display is raw" contract. Humanized example moved to a sentence aboutexamples/diagnostics.rs. -
Mock STATUS_PENDING helper specified concretely. Two viable designs documented; implementer picks the smaller. Previous version promised a helper that couldn't work against existing
auto_rewritemechanics.
Plan v2 vs v1 changelog (kept for history):
- Per-DFS-target session model:
SessionDiagnosticsmoved ontoConnectionDiagnostics::session. dispatchpath counted:requests_sentticks insideallocate_msg_id.- v2 introduced
responses_orphanedto "capture caller-drop" — v3 corrects the branch (see #1 above). - Byte semantics: wire-layer (post-encrypt on send, pre-decrypt on recv).
- v2 renamed
compound_splits→response_splits_observed; v3 defers the counter entirely. oplock_breaks_received→unsolicited_notifications_received.- Counter teardown invariant documented.
- Reconnect counter-reset semantics documented.
- Lock order initially specified (corrected in v3 #9).
#[non_exhaustive]vsPartialEq: noPartialEq,#[non_exhaustive]only on top-level containers.humanize_bytesmoved to the example binary.- Re-used types serde audit with per-type plan;
CapabilitiesmanualSerialize. --jsonflag withoutserde: stderr error + exit 2.- Example binary in
examples/; docker integration tests in existingtests/docker_integration.rs. - "Subscribe vs poll" rationale spelled out in "Why".
Displayimpl on top level only.