feat(upload): idempotent retry of the last committed chunk#145
feat(upload): idempotent retry of the last committed chunk#145rubenhensen merged 2 commits intomainfrom
Conversation
When a chunk PUT response is lost in flight the client retains the old
CryptifyToken (the one it sent) but the server has already advanced the
rolling-token chain. Without server-side help, retry is impossible: the
naïve resend hits 'token mismatch' or 'wrong offset' depending on what
the client tries.
Server now caches the just-committed chunk's
(prev_token, prev_uploaded, chunk_len, chunk_sha256, response_token) on
the FileState. The chunk handler is restructured around a
`classify_chunk_request` helper with three outcomes:
- NormalNext: state.uploaded == start && header_token == current_token
→ write, advance, refresh last_chunk
- ReplayLastChunk: every cached field matches AND body sha256 matches
→ return cached response_token, no write, no double-count
- Reject: regular 400s. Stricter 400s when the request looks like a
retry but the body or length differ — never accept different bytes
for the same offset.
The body is now read once at the top of the handler so the same bytes
feed both the replay-detection hash and the normal write path. Per-
upload limit and file-open are still checked only on the normal-write
path; on retry the original PUT already passed those.
api-description.yaml documents the contract on PUT /fileupload/{uuid}.
Tests: seven new unit tests cover NormalNext, ReplayLastChunk match,
length mismatch, body-hash mismatch, offset mismatch with no replay
record, token mismatch at correct offset, and prev_token mismatch with
a replay record present (falls through to regular Content-Range 400).
Refs #136 (resume protocol design from
@dobby-coder's audit), encryption4all/postguard-website#117.
|
/dobby review this |
|
Dobby is here, Master Ruben! Dobby reads the request and Dobby will review the chunk-retry protocol with great care! Dobby will check the classifier logic, the tests, and the wire-compat story — then Dobby will leave a proper review! |
There was a problem hiding this comment.
Review
Solid PR. The classifier abstraction reads well, the test coverage is thorough (especially the negative cases), and the wire-compat story checks out — old clients never present a (prev_token, prev_uploaded) pair that matches a cached last_chunk, so they fall straight through to the existing paths.
I built and ran the suite locally on feat/idempotent-chunk-retry: cargo test --all-targets → 49 passed, 0 failed. CI's Rust-quality job is green; the amd64/arm64 image builds were still pending at review time.
Things I'd consider before merge
-
Body is now read before the per-upload-limit and token checks. Pre-PR ordering rejected a request with the wrong
cryptifytokenor wrong offset before consuming any body. Now the body — up tochunk_size(~1 MiB) — is always read into memory first so the classifier can hash it. UUIDs are unguessable, so this is narrow, but it does slightly widen the work an attacker with a leaked UUID can force per request. Probably acceptable; flagging in case you want to gate the body read behind a cheap pre-check (e.g., only read the body whenstate.uploaded == startorlast_chunk.prev_uploaded == start). -
chunk_sha256is arguably redundant.compute_hash(last.prev_token, body) == last.response_tokenalready proves the body is byte-identical (modulo a SHA-256 collision, which would also break the rolling-token chain itself). You could dropchunk_sha256(andchunk_len, sincebody.len()is implicit in the hash input) fromLastChunkRecordand verify replay by recomputing the rolling hash. Two upsides: smallerFileState, and one less invariant to keep in sync when Phase 2 moves this to Postgres. Downside: the current explicit fields are more obvious to a reader. Judgment call — not a blocker. -
Duplicated literal
"Cryptify Token header does not match"in the classifier (src/main.rs:615) vs.check_cryptify_token(still used by finalize). The comment notes the wording is intentional; anconst TOKEN_MISMATCH_MSG: &strreferenced by both call sites would prevent silent drift if the message ever changes. -
Minor: two
shasum.clone()calls (src/main.rs:546and:553) where one would do — the finalCryptifyToken(shasum)could take the moved value andstate.cryptify_token/response_tokencould share a single clone. -
Phase 2 note (FYI, not for this PR): when
last_chunkmigrates to Postgres,response_tokenbecomes a row that must be written atomically with the file write +state.uploadedadvance, otherwise a server crash between "write file" and "persist replay record" reintroduces exactly the scenario this PR fixes. Worth keeping in mind for the migration design — probably wants the chunk write to be idempotent at the storage layer too, not just the response-replay layer.
Nothing here is blocking. Approve from this agent's side once CI's image builds finish.
Four small follow-ups from @dobby-coder's review: 1. Cheap pre-check before reading the chunk body. Mirrors the structural part of `classify_chunk_request` (normal-next OR replay-candidate) and rejects mismatches before consuming up to `chunk_size` bytes, restoring the prior security profile against UUIDs leaked to an attacker. 2. Drop `chunk_sha256` and `chunk_len` from `LastChunkRecord`. Body identity falls out of the rolling-token chain itself: recompute `compute_hash(prev_token, body)` and compare to `response_token`. Length divergence surfaces as a hash mismatch — one less invariant to keep in sync, smaller `FileState`, simpler Phase 2 schema. The length-specific test folds into the body-mismatch test. 3. Promote the duplicated "Cryptify Token header does not match" string to a `TOKEN_MISMATCH_MSG` const referenced by both `check_cryptify_token` (finalize path) and the chunk classifier so the wording can't drift silently between call sites. 4. One fewer `shasum.clone()` — the final responder takes the moved value while `state.cryptify_token` and `last_chunk.response_token` share a single clone. `sha256_of` is removed (sole caller is gone). Tests gain a `last_chunk_for(prev_token, offset, body)` helper that builds a `LastChunkRecord` whose `response_token` is the actual rolling hash, matching the production handler's invariant. Refs #145 review.
|
Thanks @dobby-coder. Pushed
Skipping #5 here since it's a Phase 2 concern. Filing it as a tracking note on the Postgres-sessions design (the persist-and-advance ordering will need to be transactional with the file write to keep the same guarantees). 49 tests pass, fmt + clippy clean. |
Bumps `@e4a/pg-js` from `^1.3.0` to `^1.4.0` to pick up the new
upload/download retry machinery (chunk PUTs and download GETs retry
with exponential backoff + full jitter on 5xx and network errors).
The bump itself is a clean semver minor — additive config, no API
changes for the website.
UX additions on top of the bump:
- **Retry indicator.** `src/lib/postguard.ts` now wires an `onRetry`
callback into the SDK that updates a small Svelte store
(`retryStatus`). Both `SendButton.svelte` (upload) and
`download/+page.svelte` subscribe to it and render a single line
("Connection hiccup, retrying… (attempt N of M)") underneath the
active spinner during the retry window. Cleared on success or
terminal error so a stale event can't leak between operations.
- **`UploadSessionExpiredError` distinguished from generic 5xx.** The
SDK now raises this dedicated error when cryptify reports
`upload_session_not_found` (idle past the configured TTL or the
server restarted). On upload it surfaces a regular Error state
with `serverError = false` so the existing copy doesn't blame the
server for a state-loss problem retry can't fix; on download it
gets its own state and copy ("Upload session expired" / "Ask the
sender to send the files once more"). Localised in en + nl.
Also unblocks the husky pre-commit hook: the repo had
`prettier-plugin-svelte` as a devDependency but never registered it
in the prettier config, so `prettier --write` on staged paths failed
on .svelte files (see #143's bypass note). The plugin is now wired
up via the `plugins` field in the package.json `prettier` block, and
all 41 .svelte files were reformatted to apply the resulting style
(mostly stray semicolons + minor whitespace). Six pre-existing
`eslint-disable-next-line` comments that prettier broke by wrapping
their target tags onto multiple lines were widened to
`eslint-disable` / `eslint-enable` blocks.
Refs: postguard-website#117, encryption4all/cryptify#136,
encryption4all/cryptify#145, encryption4all/postguard-js#47.
Summary
When a chunk PUT response is lost in flight the client retains the previous
CryptifyToken(the one it sent), but the server has already advanced the rolling-token chain. Without server-side help, retry is impossible: the naïve resend hits "token mismatch" or "wrong offset" depending on what the client tries.This PR adds the missing protocol piece — server-side detection of duplicate retries — so client-side retry (coming next in postguard-js) can actually work.
How it works
The server now caches the most recently committed chunk's
(prev_token, prev_uploaded, chunk_len, chunk_sha256, response_token)on theFileState(in-memory for now; will migrate to Postgres in Phase 2). The chunk handler is restructured around aclassify_chunk_requesthelper with three outcomes:NormalNext—state.uploaded == start && header_token == current_token. Write, advance, refreshlast_chunk.ReplayLastChunk— every cached field matches AND the body's SHA-256 matches the stored hash. Return the cachedresponse_token, no write, no double-count.Reject— regular 400s. Stricter 400s when the request looks like a retry (matchingprev_token+ offset) but the body or length differ — never accept different bytes for the same offset.The body is now read once at the top of the handler so the same bytes feed both the replay-detection hash and the normal write path. Per-upload-limit and file-open are still checked only on the normal-write path; on retry the original PUT already passed those.
Wire compatibility
Existing clients that don't retry see no change. The contract only kicks in when a client sends
(prev_token, prev_uploaded)matching the cached record — old clients never do. Documented inapi-description.yamlso future clients (postguard-js v1.4 in particular) can rely on it.Test plan
7 new unit tests cover the classifier:
classify_normal_next_chunk— happy pathclassify_replays_last_chunk_on_matching_retry— full match returns cached tokenclassify_rejects_retry_with_different_body— same prev_token + offset, body bytes differ → 400classify_rejects_retry_with_different_length— same prev_token + offset, length differs → 400classify_rejects_offset_mismatch_with_no_replay— nolast_chunkcached → regular 400classify_rejects_token_mismatch_at_correct_offset— right offset, wrong token, no replay match → 400 with the existing token-mismatch messageclassify_does_not_replay_when_prev_token_does_not_match—last_chunkcached but a differentprev_token→ falls through to regular Content-Range 400CI gate:
cargo fmt --checkpassescargo clippy --all-targets -- -D warningspassescargo test --all-targetspasses (49 tests, none regressed)Refs
The deliberately-deferred status-endpoint design from #136 (cross-refresh resume) will be filed as a separate issue once Phase 2 (Postgres-backed sessions) is in place to support it.