Skip to content

Concurrency hardening, commit rollback, and CI bumps#12

Merged
d4rken merged 3 commits into
mainfrom
concurrency-and-commit-fixes
Apr 29, 2026
Merged

Concurrency hardening, commit rollback, and CI bumps#12
d4rken merged 3 commits into
mainfrom
concurrency-and-commit-fixes

Conversation

@d4rken

@d4rken d4rken commented Apr 29, 2026

Copy link
Copy Markdown
Member

Summary

  • Prevent device resurrection without serializing auth — concurrent updateDevice + deleteDevice could resurrect a deleted device because per-device sync was acquired without coordinating against the global devices map. Holding the global mutex through file I/O would have stalled all auth on the server behind any slow per-device operation. New design holds the global mutex only for short critical sections (map ops, identity-pinned commit) and uses per-instance reference equality + a PendingDeviceDeletion tracker (CompletableDeferred indexed by both DeviceKey and DeviceId) to gate createDevice against in-flight path deletions.
  • Roll back consumed blobs and quota on commit failure — old rollbackMoves() only unlinked moved blob files; it didn't release reservations or remove the upload sessions whose payloads had been consumed. A commit that failed after consumeAndMoveCompletedBlob (e.g. module.json rename throws) leaked reservedBytes and orphan sessions. commitModule now tracks each consumed blob explicitly and rolls back payload + session + reservation before the module.json rename. After the rename (the commit point), an access.json failure is logged-and-swallowed.
  • CI maintenance — action pin bumps (checkout v6.0.2, setup-java v5.2.0, cache v5.0.3, action-gh-release v3), JDK distribution adopttemurin (Adoptium rebrand), drop jimschubert/query-tag-action in favor of github.ref_name, collapse dual release-tag steps with conditional prerelease (-beta only — intentional), KSP 2.3.5 → 2.3.6, Dagger 2.59.1 → 2.59.2.

Test plan

  • ./gradlew check passes locally (5m 40s)
  • DeviceRepoTest — 4 tests: 2 resurrection-prevention + 2 no-head-of-line-blocking
  • BlobQuotaBoundaryFlowTest.failed PUT after consuming blob releases reservation
  • CI green: code-checks.yml (assemble, check, Docker build, cross-version regression replay)
  • Next tag cut produces a release with the simplified conditional prerelease step (no functional difference for non--beta tags)

d4rken added 3 commits April 29, 2026 13:20
- actions/checkout v4 -> v6.0.2, actions/setup-java v3 -> v5.2.0, actions/cache v3 -> v5.0.3 (commit-pinned)

- JDK distribution adopt -> temurin (Adoptium rebrand)

- drop jimschubert/query-tag-action; read github.ref_name directly; collapse dual release-tag steps with a conditional prerelease (-beta only)

- KSP 2.3.5 -> 2.3.6, Dagger 2.59.1 -> 2.59.2

- fix stale JDK 17 / Gradle 8.13 line in build-commands.md
Concurrent updateDevice + deleteDevice could resurrect a deleted device because per-device sync was acquired without coordinating against the global devices map. Holding the global mutex through file I/O would have fixed correctness but stalled all auth on the server behind any slow per-device operation, since updateDevice runs on every authenticated request.

New design holds the global mutex only for short critical sections:

- updateDevice fetches under a brief global mutex, then re-checks under sync that the same Device instance is still in the map (reference-equality on the per-device Mutex), commits the map update under another brief mutex slice, and never holds the global lock during file I/O.

- deleteDevice / deleteDevices register a PendingDeviceDeletion (CompletableDeferred indexed by DeviceKey + DeviceId) under the brief global mutex, then drop the path under per-device sync only. Tracker cleanup runs in a NonCancellable scope so it always clears.

- createDevice awaits any matching pendingDeletion before recreating, preventing a resurrected path from being clobbered by an in-flight delete.

Adds four DeviceRepoTest cases pinning resurrection prevention and the no-head-of-line-blocking property for unrelated devices.
Old rollbackMoves() only unlinked moved blob files; it did not release reservations or remove the upload sessions whose payloads had been consumed. A commit that failed after consumeAndMoveCompletedBlob (e.g. module.json rename throws) leaked reservedBytes and orphan sessions.

commitModule now tracks each consumed blob in a ConsumedBlob struct (meta, live blob path, session dir) and exposes an explicit commitPointReached flag set right after the module.json.tmp -> module.json rename. Before that point, rollbackConsumedBlobs() does the full reverse: delete payload, removeConsumedSession, releaseReservation. After the commit point, even an access.json write failure is logged-and-swallowed - the module is committed and must not turn into a quota rollback.

UploadSessionRepo.ConsumeResult.Ready now carries sessionDir so callers don't re-resolve it. New removeConsumedSession() deletes a session whose payload was already moved out, without touching quota (the caller owns the reserved -> used transition).

Adds a BlobQuotaBoundaryFlowTest case that corrupts module.json.tmp (creates a directory with that name so rename fails), asserts 500 + reservedBytes back to 0 + session removed + a fresh session can be created.
@d4rken d4rken added bug Something isn't working enhancement New feature or request labels Apr 29, 2026
@d4rken d4rken merged commit 2e831e3 into main Apr 29, 2026
5 checks passed
@d4rken d4rken deleted the concurrency-and-commit-fixes branch April 29, 2026 11:35
emtee40 pushed a commit to emtee40/octi-sync-server-kotlin that referenced this pull request May 4, 2026
Tier B batch 3 from code review:

- BlobFlowTest: new test asserts that the per-route RequestBodyLimit (1 MB) overrides the global 128 KB body limit on PATCH. Empirical result: 204 NoContent. Codex's claim that the child plugin install does not override the parent is wrong on Ktor 3 — closing finding d4rken-org#12 as a false positive.

- DeviceRoute.deleteDevice: added a comment documenting why lifecycleService.deleteForDevice runs before deviceRepo.deleteDevice. The Codex finding d4rken-org#20 ("orphan data behind") doesn't materialise because DeviceRepo.deleteDevice does a recursive delete of the whole device path under target.sync — any in-flight write landing between the two calls is wiped clean. Closing d4rken-org#20 as no-op.

- ModuleLifecycleService.commitModule: KDoc note about the §"Concurrency And Locking" plan deviation (module lock acquired before session lock instead of the prescribed reverse). No deadlock today; revisit if the deferred per-module lifecycle flag is added.

- PLAN-BLOB-SUPPORT.md §"Implementation Deferrals": added the lock-order inversion as an explicit deferred item so the deviation is visible in the plan.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant