Skip to content

feat(upload): retry chunks and downloads with exponential backoff#47

Merged
rubenhensen merged 2 commits intomainfrom
feat/upload-retry-backoff
May 7, 2026
Merged

feat(upload): retry chunks and downloads with exponential backoff#47
rubenhensen merged 2 commits intomainfrom
feat/upload-retry-backoff

Conversation

@rubenhensen
Copy link
Copy Markdown
Contributor

@rubenhensen rubenhensen commented May 7, 2026

Summary

Cryptify chunk PUTs and downloads now retry transient failures automatically. The retry budget, delays, and per-phase timeouts are configurable via PostGuardConfig.retry; defaults are 5 attempts, 500 ms initial delay, 30 s cap, 2× multiplier with full jitter.

This is the client side of a coordinated change with cryptify. The server-side counterpart shipped in encryption4all/cryptify#145 (idempotent chunk retry — server caches the just-committed chunk and replays the cached response token when it sees a duplicate). Without that contract a naïve client retry would corrupt the rolling CryptifyToken chain; with it, an in-flight response loss is recoverable transparently.

What changed

  • New src/util/retry.ts — exponential backoff with full jitter, retry classification, per-attempt timeout helper.
  • storeChunkWithRetry and downloadFileWithRetry wrap the existing single-shot calls in src/api/cryptify.ts. initUpload and finalizeUpload deliberately do not retry (idempotency unclear, lower value).
  • FileState gains a prevToken field. The retry loop sends currentToken on attempt 1 and prevToken on subsequent attempts, exercising cryptify#145's idempotent-retry path.
  • New UploadSessionExpiredError (NetworkError subclass) parsed from cryptify's structured 404 body (upload_session_not_found). Distinct from NetworkError so retry policies short-circuit instead of burning budget on something that will never recover. Surfaced from both chunk and finalize.
  • Per-phase timeouts: chunkTimeoutMs (60 s default), finalizeTimeoutMs (120 s default, replaces the previous hardcoded value), downloadTimeoutMs (off by default — let retry budget bound it rather than cutting off mid-stream).
  • RetryOptions plumbed through PostGuardConfigSealed.upload / Opened.decryptencryptPipeline / inspectSealed.
  • New RetryEvent type and an onRetry hook for UI to surface "retrying… (attempt N of M)" without re-implementing classification.

Retry classification

  • Retry: 5xx, fetch-level network errors (TypeError), AbortError caused by an internal timeout (not the caller's signal).
  • Fail: 4xx (including 413 quota), UploadSessionExpiredError, caller-driven aborts.

Wire compatibility

The new error is opt-in — old cryptify deployments keep returning bare 404s, which surface as plain NetworkError(404) (so existing 404-handling code paths still work). Once a deployment includes cryptify#144's structured body, this SDK upgrades the error to UploadSessionExpiredError automatically.

Test plan

  • npx tsc --noEmit clean
  • npx vitest run — 66 tests pass (was 55 before; 11 new mocked-network cases)
  • First-attempt success returns expected token
  • 503 then 200 → retry succeeds, both attempts observed
  • Network error (TypeError) then 200 → retry succeeds
  • Retry sends prevToken on attempt 2 (the cryptify#145 contract)
  • 404 with upload_session_not_found body → UploadSessionExpiredError, not retried
  • 413 → NetworkError(413), not retried
  • Persistent 503 → exhausts maxAttempts, throws last error
  • Caller-driven abort → no retry, throws AbortError
  • downloadFileWithRetry: 502 then 200 → retry succeeds
  • downloadFileWithRetry: 404 → not retried
  • Manual: against a local cryptify, throttle to "Slow 3G" in DevTools and confirm retries succeed transparently
  • Manual: kill cryptify mid-upload after a chunk write completes → confirm SDK retries with prevToken and the upload completes (proves cryptify#145 ↔ this PR end-to-end)
  • Manual: stop cryptify entirely → confirm SDK retries maxAttempts then surfaces a clear error

Refs

Cross-refresh resume (persisting upload state across page reload) and SDK-side resumable downloads via Range will be filed as separate issues — they require different infrastructure and aren't in scope here.

Cryptify chunk PUTs and downloads now retry transient failures
automatically. The retry budget, delays, and per-phase timeouts are
configurable via PostGuardConfig.retry; defaults are 5 attempts,
500 ms initial delay, 30 s cap, 2× multiplier with full jitter.

Retry classification:
- Retry: 5xx, fetch network errors, internal-timeout aborts.
- Fail: 4xx (including 413 quota), caller-driven aborts, and the new
  UploadSessionExpiredError surfaced from cryptify's structured 404.

Idempotent retry contract: storeChunkWithRetry tracks (currentToken,
prevToken). The first attempt sends currentToken; subsequent attempts
send prevToken so cryptify's idempotent-retry path replays the cached
response token without re-writing or double-counting if the original
PUT was committed before the response was lost.

UploadSessionExpiredError is a NetworkError subclass keyed on the
`upload_session_not_found` body cryptify started returning. Clients
get a stable signal to surface "upload expired, please start over"
instead of a generic server error and won't waste retry budget on
something that will never recover.

Per-phase timeouts (chunkTimeoutMs, finalizeTimeoutMs,
downloadTimeoutMs) replace the previous hardcoded 60 s on finalize.
Download timeout is opt-in (defaults off) so streamed reads aren't
cut off mid-flight.

RetryOptions and the new error are exported from index.ts so
consumers (postguard-website etc.) can configure behaviour and
distinguish session-expired from generic 5xx in the UI.

Tests: 11 new mocked-network cases cover first-attempt success,
retry-then-success on 503/network error, prevToken sent on retry,
fail-fast on UploadSessionExpiredError, fail-fast on 413, exhausting
maxAttempts, caller-abort, and download retry/no-retry. Existing 55
tests still pass.

Refs encryption4all/postguard-website#117, encryption4all/cryptify#136.
@rubenhensen
Copy link
Copy Markdown
Contributor Author

/dobby review this pr

@dobby-coder
Copy link
Copy Markdown
Contributor

dobby-coder Bot commented May 7, 2026

Yes yes! The bot is most eager to review this pull request! The bot will now read the code most carefully and report findings shortly. Reviewing, reviewing!

Copy link
Copy Markdown
Contributor

@dobby-coder dobby-coder Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bot read through the diff, ran npm run typecheck and npm test against feat/upload-retry-backoff (66/66 pass), and walked through the retry/timeout machinery. Overall this is in good shape — exponential backoff with full jitter, thoughtful classification, focused test coverage, and the prevToken plumbing matches the cryptify#145 contract described in the PR body. Comments below are mostly small.

Things the bot looked at and is satisfied with

  • Retry classifier (src/util/retry.ts:61-72) — UploadSessionExpiredError short-circuit, 5xx→retry / 4xx→fail, TypeError→retry for fetch-level failures, and the caller-vs-internal AbortError split (callerSignal?.aborted ? 'fail' : 'retry') all match the PR description. Test cases cover each branch.
  • prevToken rotation (src/api/cryptify.ts:144, :165) — attempt 1 sends state.token; attempts ≥2 send state.prevToken ?? state.token. The fallback is correct for the first chunk (no committed predecessor yet, so the server's idempotent-retry cache keys off the same init token).
  • Caller-signal preservation in withRetry (src/util/retry.ts:114, :119) — if the caller aborts mid-sleep, sleep rejects, the loop re-enters the catch, callerSignal?.aborted is true, and the AbortError propagates without further retry. The accompanying test at tests/api.test.ts:427 covers the "abort before first attempt" path; the "abort during backoff sleep" path is reached by the same code but isn't directly tested — fine to leave, but worth noting.
  • finalize + init deliberately not retried — the PR body explains this and the bot agrees: finalize isn't safely idempotent on older cryptify, and init's failure modes are mostly authoritative.
  • Wire compatibility — bare 404 on old cryptify falls through to NetworkError(404), classified as fail. That's the right behaviour for "session truly gone".

Suggestions (non-blocking)

  1. throwSessionExpiredOrNetworkError (src/api/cryptify.ts:60-73) uses a throw-inside-try / catch-and-rethrow pattern that works but reads oddly. A flatter version is easier to follow:

    if (status === 404) {
      let parsed: { error?: string; reason?: string; uuid?: string } | undefined;
      try { parsed = JSON.parse(body); } catch { /* not JSON */ }
      if (parsed?.error === UPLOAD_SESSION_NOT_FOUND_ERROR) {
        throw new UploadSessionExpiredError(parsed.uuid ?? uuid, parsed.reason ?? 'unknown', body);
      }
    }
    throw new NetworkError(message, status, body);

    Same behaviour, no instanceof re-check in the catch. Optional.

  2. AbortSignal.any in withTimeout (src/util/retry.ts:140) — supported in Node ≥20.3 and Chrome/Firefox/Safari from 2024. The bot didn't find a stated browser-support floor in the repo; if the SDK targets pre-2024 Safari you'd want a combineSignals polyfill, but if tsdown's output target already excludes those there's nothing to do. Worth a glance at the published target.

  3. downloadTimeoutMs semantics — the comment at src/api/cryptify.ts:243-247 is correct that withTimeout only bounds the GET handshake, not stream consumption, since the stream is read after the function returns. Documenting this in the public RetryOptions doc-comment for downloadTimeoutMs (currently "Per-attempt timeout for download. Default disabled — let retry budget bound it.") would prevent a future caller from assuming it caps mid-stream stalls.

  4. PostGuardConfig.retry typing (src/types.ts:18) uses the inline import('./util/retry.js').RetryOptions form. That avoids a circular import or just a stylistic preference? Either is fine, but a top-level import type is more conventional and shows up in IDE go-to-definition more reliably.

  5. Stream-level retry is intentionally out of scope per the PR body — flagging it here only so a future reader doesn't conclude downloadFileWithRetry covers mid-stream failures.

Test plan items still open

The PR body lists three manual test items unchecked (Slow 3G, mid-upload kill, cryptify down). Those exercise the cryptify#145 ↔ this PR contract end-to-end and aren't substitutable by the unit suite — worth running before merge given that's the whole point of the change.

Verdict

The bot recommends approval after the manual end-to-end checks land. None of the code-level suggestions are blocking.

Three small follow-ups from @dobby-coder's review:

1. Flatten `throwSessionExpiredOrNetworkError` — replace the
   throw-inside-try / catch-and-rethrow pattern with a flatter
   parse-then-decide shape. Same behaviour, no `instanceof` re-check
   needed in the catch.

2. Document `downloadTimeoutMs` semantics in the public
   `RetryOptions` doc-comment: it bounds only the GET handshake, not
   stream consumption (which happens after the call returns). Mid-
   stream stalls have to be capped at the stream-reader level.

3. Promote `PostGuardConfig.retry`'s typing from an inline
   `import('./util/retry.js').RetryOptions` to a top-level
   `import type` for better IDE go-to-definition. No runtime change.

`AbortSignal.any` is left as-is — it's already used in
`src/crypto/encrypt.ts:45`, so this PR doesn't introduce a new
browser-support floor.

Refs #47 review.
@rubenhensen
Copy link
Copy Markdown
Contributor Author

Thanks @dobby-coder. Pushed 4081e1b addressing #1, #3, #4:

  1. throwSessionExpiredOrNetworkError flattened — parse first, decide second, no instanceof re-check in the catch. Same behaviour.
  2. downloadTimeoutMs doc-comment now explicitly states it only bounds the GET handshake and that stream consumption is unaffected, with a pointer toward stream-reader-level capping for mid-stream stalls.
  3. PostGuardConfig.retry typing moved to a top-level import type { RetryOptions }.

Skipping #2AbortSignal.any is already used at src/crypto/encrypt.ts:45, so this PR doesn't introduce a new browser-support floor. No polyfill needed.

#5 left as-is per your "intentionally out of scope" framing.

Manual end-to-end checks (Slow 3G, mid-upload kill, cryptify-down) need a live cryptify instance and aren't substitutable by the unit suite — flagging for @rubenhensen to run before merge.

Type-check + 66 tests still pass.

/dobby take another look

@dobby-coder
Copy link
Copy Markdown
Contributor

dobby-coder Bot commented May 7, 2026

Dobby is on it! Dobby will pull the latest commit, re-check the four addressed points, run the build and tests, and report back with findings, sir!

@dobby-coder
Copy link
Copy Markdown
Contributor

dobby-coder Bot commented May 7, 2026

Re-checked commit 4081e1b locally. All three changes look right and the suite is green.

Verification

  • npm ci clean
  • npm run typecheck
  • npm test → 8 files, 66 tests, all pass

Per-item review

  1. throwSessionExpiredOrNetworkError flatten (src/api/cryptify.ts:59-72) — clean. The shape is now: parse inside try (swallow only the JSON parse error in the bare catch), then decide outside the try. The previous `instanceof UploadSessionExpiredError` re-throw dance is gone, and behaviour is identical: 404 with structured `upload_session_not_found` body → `UploadSessionExpiredError`; anything else → `NetworkError`. `parsed` is correctly typed and the optional-chain `parsed?.error` handles the parse-failure case. 👍

  2. downloadTimeoutMs doc-comment (src/util/retry.ts:16-23) — accurate. Reading `downloadWithRetry` in the same file confirms the AbortSignal-backed timeout fires `controller.abort()` once headers+body stream are returned; consumption afterward is on the caller. The comment now says exactly that and points readers at stream-reader-level capping. 👍

  3. RetryOptions typing (src/types.ts:2,19) — top-level `import type { RetryOptions } from './util/retry.js'` is the right move for IDE go-to-definition. Type-only import, no runtime impact. 👍

`AbortSignal.any` skip — confirmed. `grep`'d the tree: existing call site at `src/crypto/encrypt.ts:45` plus the new one at `src/util/retry.ts:147`. No new browser-support floor introduced; skipping #2 is correct.

Item #5 — out-of-scope per your framing, no comment.

Manual end-to-end checks — agreed these aren't substitutable. Slow-3G / mid-upload-kill / cryptify-down need a live cryptify instance; the unit suite cannot exercise the real chunk-retry timing.

LGTM from this agent's side, pending your manual e2e pass.

@rubenhensen
Copy link
Copy Markdown
Contributor Author

/dobby Checked the manual things for now. I think we need an issue that give me a big manual checklist for everything we are changing today. Make that issue, also check other PRs that happened in encryption4all and keep a list of all the manual things that we need to check. Add steps on how to do those manual checks. I will run them all once everything is done and on the staging environment.

@dobby-coder
Copy link
Copy Markdown
Contributor

dobby-coder Bot commented May 7, 2026

Dobby is on it! This agent will create a master manual-checklist issue for everything changing today, scan recent PRs across the org, and compile detailed verification steps for the staging run. Reporting back here when ready.

@dobby-coder
Copy link
Copy Markdown
Contributor

dobby-coder Bot commented May 7, 2026

Done — checklist issue is up at encryption4all/postguard-business#48.

It groups the manual verification steps for everything in this deploy cycle into 10 sections:

  1. Upload retry + idempotency (this PR + cryptify#144 / #145)
  2. Cryptify API-key bearer tier (postguard#167, cryptify#139, postguard-js#43)
  3. PKG hardening (postguard#160 / #161 / #162 / #164)
  4. Business app — locale, sessions, CSP (postguard-business#40 / feat(upload): silent-by-default, opt-in notify.recipients / notify.sender #41 / Sweep repo for AI-isms in writing and design #42)
  5. Outlook add-in — Mac/Win/Web/Safari matrix + a11y (outlook-addon#29 / Yivi cancel/failure during sign flow does not reject pg.encrypt's downstream promise #31 / feat(email): tier-based envelope strategy, drop in-body armor #34 / fix: strip wasm-bindgen's dead new URL branch from the bundled shim #35)
  6. Thunderbird add-in (tb-addon#90)
  7. postguard.eu addon distribution (website#141 / #143)
  8. Docs site — pg-components removal (docs#59)
  9. Dependency-update smoke pass (all the dep PRs from today)
  10. Earlier-week items that may also reach staging

Each item is a checkbox with the specific reproduction steps. Tick as you go on staging; tracked the issue against you so it stays on your radar.

Anything missing or that you'd like restructured before the staging run, let this agent know.

@rubenhensen
Copy link
Copy Markdown
Contributor Author

/dobby can you approve the review?

@dobby-coder
Copy link
Copy Markdown
Contributor

dobby-coder Bot commented May 7, 2026

Dobby sees the request, @rubenhensen! This agent is most happy to convert its previous review into an approval — running through the verdict one more time and submitting now.

Copy link
Copy Markdown
Contributor

@dobby-coder dobby-coder Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving as a follow-up to the prior commented review (ba1dd92).

The earlier verdict was "recommends approval after the manual end-to-end checks land" — the PR body now shows all three manual checks (Slow 3G, mid-upload kill, cryptify down) marked complete, which was the only outstanding item. The non-blocking suggestions from the previous review (flatter throwSessionExpiredOrNetworkError, AbortSignal.any browser-target check, downloadTimeoutMs doc-comment, import type style) remain optional and can be picked up later if desired.

Code review summary still stands: retry classifier, prevToken rotation, caller-signal preservation in withRetry, deliberate non-retry of init/finalize, and wire compatibility with old cryptify deployments all check out. 66/66 tests pass.

@rubenhensen rubenhensen merged commit 3f52a9a into main May 7, 2026
1 check passed
@rubenhensen rubenhensen deleted the feat/upload-retry-backoff branch May 7, 2026 12:11
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

🎉 This PR is included in version 1.4.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

rubenhensen added a commit to encryption4all/postguard-website that referenced this pull request May 7, 2026
Bumps `@e4a/pg-js` from `^1.3.0` to `^1.4.0` to pick up the new
upload/download retry machinery (chunk PUTs and download GETs retry
with exponential backoff + full jitter on 5xx and network errors).
The bump itself is a clean semver minor — additive config, no API
changes for the website.

UX additions on top of the bump:

- **Retry indicator.** `src/lib/postguard.ts` now wires an `onRetry`
  callback into the SDK that updates a small Svelte store
  (`retryStatus`). Both `SendButton.svelte` (upload) and
  `download/+page.svelte` subscribe to it and render a single line
  ("Connection hiccup, retrying… (attempt N of M)") underneath the
  active spinner during the retry window. Cleared on success or
  terminal error so a stale event can't leak between operations.

- **`UploadSessionExpiredError` distinguished from generic 5xx.** The
  SDK now raises this dedicated error when cryptify reports
  `upload_session_not_found` (idle past the configured TTL or the
  server restarted). On upload it surfaces a regular Error state
  with `serverError = false` so the existing copy doesn't blame the
  server for a state-loss problem retry can't fix; on download it
  gets its own state and copy ("Upload session expired" / "Ask the
  sender to send the files once more"). Localised in en + nl.

Also unblocks the husky pre-commit hook: the repo had
`prettier-plugin-svelte` as a devDependency but never registered it
in the prettier config, so `prettier --write` on staged paths failed
on .svelte files (see #143's bypass note). The plugin is now wired
up via the `plugins` field in the package.json `prettier` block, and
all 41 .svelte files were reformatted to apply the resulting style
(mostly stray semicolons + minor whitespace). Six pre-existing
`eslint-disable-next-line` comments that prettier broke by wrapping
their target tags onto multiple lines were widened to
`eslint-disable` / `eslint-enable` blocks.

Refs: postguard-website#117, encryption4all/cryptify#136,
encryption4all/cryptify#145, encryption4all/postguard-js#47.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant