Skip to content

feat(09): backup API for embedded and server modes#81

Merged
tazarov merged 20 commits into
mainfrom
feat/phase-09-backup-api
Mar 27, 2026
Merged

feat(09): backup API for embedded and server modes#81
tazarov merged 20 commits into
mainfrom
feat/phase-09-backup-api

Conversation

@tazarov

@tazarov tazarov commented Mar 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Phase 9: Backup API
Goal: Users can back up Chroma data from both embedded and server modes, producing a directory with a manifest file that records backup metadata
Status: Verified (6/6 must-haves)

Implements Java backup API across core, JNA, and Panama modules. BackupExecutor handles the close-copy-manifest-restart lifecycle with transactional error handling. Both EmbeddedSession.backup() and ServerSession.backup() delegate to this executor with mode-specific close/restart actions. Produces a backup directory containing a persist/ snapshot and backup_manifest.json with SHA-256 checksums.

Changes

Plan 01: Core backup implementation

  • BackupExecutor — static utility implementing close-copy-manifest-restart with cleanup-on-failure
  • BackupResult<S> — generic record wrapping manifest + new session
  • BackupMode — enum for embedded/server wire format
  • BackupOptions — builder with destination, includeMetadata, leaveInactive flags
  • Backup callback slots wired into EmbeddedSession and ServerSession
  • JNA and Panama runtimes construct backup lambdas with config extraction and session restart

Plan 02: Integration tests

  • JNA embedded + server backup tests (5 methods each)
  • Panama embedded + server backup tests (5 methods each)
  • Sentinel file verification, manifest parsing, mode-specific validation

Review fixes (18 findings addressed)

  • Fix native handle leak on pre-validation failure
  • Fix premature backup directory deletion when restart succeeds
  • Reject non-existent source path instead of silent empty backup
  • Narrow exception catch to IOException only
  • Propagate IOException in deleteDirectoryQuietly
  • Return "0644" fallback on non-POSIX instead of null
  • Remove dead Gson no-arg constructor from BackupManifest
  • Add numeric bounds checks on fileCount/totalBytes/sizeBytes
  • Replace internal "Phase 10" references with user-facing messages
  • Remove dead BackupOptions.toJson()
  • Add 12 new test methods covering null rejection, list immutability, session lifecycle

Requirements Addressed

  • BKUP-01: EmbeddedSession.backup(options) performs directory copy with manifest
  • BKUP-02: ServerSession.backup(options) performs stop-backup-restart cycle
  • BKUP-03: BackupOptions builder supports destination, includeMetadata, leaveInactive
  • BKUP-04: Integration tests verify backup in both JNA and Panama backends

Verification

  • Automated verification: 6/6 must-haves passed
  • UAT: 7/7 scenarios passed
  • Code review: 18 findings identified and fixed

Key Files

Created:

  • java/core/.../BackupExecutor.java — core backup algorithm
  • java/core/.../BackupResult.java — generic result record
  • java/core/.../BackupMode.java — embedded/server enum
  • java/core/.../BackupFileMetadataTest.java — metadata validation tests
  • java/jna/.../JnaEmbeddedBackupTest.java + JnaServerBackupTest.java
  • java/panama/.../PanamaEmbeddedBackupTest.java + PanamaServerBackupTest.java

Modified:

  • java/core/.../EmbeddedSession.java — backup method + lock
  • java/core/.../ServerSession.java — backup method + lock
  • java/core/.../BackupManifest.java — bounds checks, removed dead constructor
  • java/core/.../BackupOptions.java — removed dead toJson
  • java/jna/.../JnaChromaRuntime.java — backup lambda construction
  • java/panama/.../PanamaChromaRuntime.java — backup lambda construction

tazarov added 18 commits March 27, 2026 07:04
…t constructors

- BackupResult<S> wraps BackupManifest + new session (nullable for leave-closed/stopped)
- BackupExecutor implements close-copy-manifest-restart algorithm with SHA-256 hashing
- All-args package-private constructors added to BackupManifest and BackupFileMetadata
- Mode validation: embedded rejects leaveStopped, server rejects leaveClosed
- Unit tests cover copy, manifest JSON round-trip, mode rejection, path containment
- EmbeddedSession expanded to 8-param constructor with backup callback slot
- ServerSession expanded to 7-param constructor, backup stub replaced with real impl
- JNA backend constructs backup lambdas in doStartEmbedded and doStartServer
- Panama backend constructs backup lambdas in doStartEmbedded and doStartServer
- BackupExecutor made public for cross-module access from backends
- Tests updated for new constructor signatures with null-rejection coverage
- SUMMARY.md with execution metrics and deviation documentation
- STATE.md updated with phase 9 position and decisions
- ROADMAP.md marks phase 9 plan 1 complete
- REQUIREMENTS.md marks BKUP-01, BKUP-02, BKUP-03 complete
# Conflicts:
#	.planning/ROADMAP.md
#	.planning/STATE.md
- Sentinel file copy, manifest parsing, leaveClosed behavior
- Mode-specific option rejection (leaveStopped on embedded)
- Null options and closed session guard tests
- Sentinel file copy, manifest parsing, server restart after backup
- leaveStopped behavior, leaveClosed rejection
- Null options and closed session guard tests
- findFreePort helper for port allocation
- SUMMARY.md with 20 tests across 4 files
- STATE.md, ROADMAP.md, REQUIREMENTS.md updated
- BKUP-04 marked complete
Critical:
- Fix native handle leak: move closed.set(true) to finally block after
  backupAction runs, preventing handle leak on pre-validation failures
- Fix premature directory deletion: only clean up backup dir in
  leave-inactive and restart-failure paths, not when restart succeeds

High:
- Reject non-existent source path with explicit error instead of
  silently creating empty backup
- Narrow catch to IOException only; let RuntimeException propagate
  immediately instead of treating programming errors as recoverable
- Propagate IOException exc in deleteDirectoryQuietly postVisitDirectory

Important:
- Return "0644" fallback on non-POSIX systems instead of null for
  manifest mode field interop with Go implementation
- Remove dead Gson no-arg constructor from BackupManifest
- Add fileCount/totalBytes/sizeBytes non-negative bounds checks
- Replace internal "Phase 10" references with user-facing messages
- Remove dead BackupOptions.toJson()

Tests:
- Add 5 null-arg rejection tests for BackupExecutor.execute()
- Add backupSetsClosedAfterActionNotBefore test
- Add closeAfterBackupIsNoOp tests for both session types
- Add manifest list immutability tests
- Add extractPersistPath precedence test
- Add BackupFileMetadataTest with null/bounds validation
@tazarov

tazarov commented Mar 27, 2026

Copy link
Copy Markdown
Contributor Author

@claude

@claude

claude Bot commented Mar 27, 2026

Copy link
Copy Markdown

Claude finished @tazarov's task in 2m 24s —— View job


Code Review — feat(09): backup API

  • Read key changed files
  • Review BackupExecutor core logic
  • Review BackupManifest / BackupResult / BackupMode
  • Review EmbeddedSession and ServerSession changes
  • Review JNA and Panama runtime wiring
  • Review test coverage
  • Compile findings and post review

Overall the implementation is solid — the transactional error-handling model is well-thought-out, the lock-based concurrency in session classes is correct, and the test coverage is thorough. A few issues are worth addressing before merge.


Findings

1. Bug: Partial backup directory not cleaned up when copy fails and restart succeeds

BackupExecutor.java lines 109–127

When backupError != null (copy/manifest write failed) and restartAction succeeds, the code at line 125 throws backupError without deleting dest. This leaves a partial backup directory — files without a valid manifest — that looks like a real backup but is not.

// restart succeeded
if (backupError != null) {
    throw toRuntimeException(backupError);  // dest NOT cleaned up
}

The leaveInactive=true path correctly cleans up (lines 102–104), and the restart-failure path also cleans up (lines 114–115), but the restart-success path with backup failure does not.

Fix this →

The test restartIsAttemptedOnCopyFailure verifies that restart happens but does not assert that dest is absent afterwards — a gap that would catch this.


2. Inconsistency: embedded persist path sourced from YAML, server sourced from FFI

JnaChromaRuntime.java lines 108–113 vs 141

For embedded sessions, persistPath is extracted from the YAML config string via BackupExecutor.extractPersistPath(configYaml). If the YAML contains a relative path (e.g. persist_path: ./data), the value captured is the raw string "./data". The executor then resolves it against the JVM working directory via Path.of(persistPath).toAbsolutePath().

For server sessions (line 141), the path is fetched from the running shim via serverPersistPath(handle), which returns the path as the native process resolved it.

If the Chroma shim resolves relative paths differently from the JVM (e.g., relative to the shim's own working directory or binary location), the embedded backup would copy from the wrong directory. In practice the working directories are likely the same, but the server approach is more robust. Consider using an FFI accessor for embedded sessions too, or documenting the assumption.


3. Minor: savedYaml is a redundant alias

JnaChromaRuntime.java line 114 and PanamaChromaRuntime.java (equivalent)

final String savedYaml = configYaml;

configYaml is already effectively final as a method parameter. The alias adds noise. Use configYaml directly in the lambda.


4. Minor: wrapperVersion is hardcoded to "java", not a version number

BackupExecutor.java line 85

"java",   // wrapperVersion

This field is named wrapperVersion but contains "java" — a language identifier rather than a version. If the intent is to record which wrapper produced the backup, it should include a version (e.g. "java-1.0.0" or read from a build constant). As-is, it offers no version differentiation between releases.


5. Minor: BackupFileMetadata.mode is nullable but undocumented in the API

BackupFileMetadata.java line 5–12

The compact constructor comment says "mode is intentionally nullable" but the public API exposes mode() without a @Nullable annotation or Javadoc. readFileMode() actually never passes null (returns "0644" on non-POSIX), so callers can't observe null today — but the record allows it, which is surprising. Either enforce non-null (since the fallback is in the caller) or document the nullability clearly.


6. Minor: BackupManifest.files() returns empty list when includeMetadata=false

BackupManifest.java line 53

public List<BackupFileMetadata> files() { 
    return files == null ? Collections.emptyList() : Collections.unmodifiableList(files); 
}

When includeMetadata=false, files is null internally, but callers get an empty list. This means callers can't distinguish "no files were found" from "metadata was not collected". Checking includeMetadata() first disambiguates, but it's an extra step. A Javadoc note here would help.


Test Coverage Assessment

The unit-test suite for BackupExecutor is comprehensive — covers happy paths, error flows, cleanup semantics, SHA-256 correctness, symlink rejection, YAML extraction edge cases, and immutability. The integration test structure (JNA + Panama × embedded + server) is well-factored.

One gap: no test verifies that the partial dest is absent after copy-failure-with-successful-restart (see finding #1).


Summary

Priority Finding
Bug #1 — Partial backup not cleaned up when copy fails + restart succeeds
Design #2 — Embedded persistPath sourced from YAML vs FFI
Minor #3, #4, #5, #6 — Cosmetic / documentation

- Fix partial backup cleanup on copy-fail + restart-success path
- Use FFI chroma_embedded_persist_path instead of YAML parsing
- Remove redundant savedYaml aliases in JNA/Panama runtimes
- Pass actual shim version as wrapperVersion in backup manifest
- Enforce non-null mode in BackupFileMetadata record
@tazarov

tazarov commented Mar 27, 2026

Copy link
Copy Markdown
Contributor Author

@claude

@claude

claude Bot commented Mar 27, 2026

Copy link
Copy Markdown

Claude finished @tazarov's task in 2m 37s —— View job


Code Review — feat(09): backup API (round 2)

  • Read key changed files
  • Check if previous findings were addressed
  • Review BackupExecutor core logic
  • Review session lifecycle and handle management
  • Post final review

Previous Findings — Status

All 6 findings from round 1 have been addressed:

# Finding Status
1 Bug: partial backup not cleaned up on copy-fail + restart-success ✅ Fixed (BackupExecutor.java:124-127 + test at line 287)
2 Inconsistency: embedded persist path from YAML vs FFI ✅ Fixed — both modes now use embeddedPersistPath(handle) via FFI
3 Minor: redundant savedYaml alias ✅ Fixed — lambda captures configYaml directly
4 Minor: wrapperVersion hardcoded to "java" ✅ Fixed — now uses doVersion() (actual shim version)
5 Minor: BackupFileMetadata.mode nullable but undocumented ✅ Fixed — compact constructor now rejects null via requireNonNull
6 Minor: BackupManifest.files() returns empty list without Javadoc Still present — cosmetic only

New Finding

Bug: Native handle leaked on pre-validation failure in backup()

EmbeddedSession.java:130-134 and ServerSession.java:137-141

Both session classes have the same pattern:

try {
    return backupAction.apply(options);
} finally {
    closed.set(true);          // runs even when BackupExecutor throws before close
}

BackupExecutor.execute() can throw before calling closeAction — specifically during pre-validation at:

  • line 52: isWithinPath(dest, source)IllegalArgumentException if dest is inside source
  • line 57: ensureEmptyDir(dest)IllegalArgumentException if dest is non-empty/file, or ChromaException on I/O failure

If either of these throws, the session's closeAction has not been called — the native handle is still alive. But closed.set(true) still executes in the finally block, which means:

  • The session is now in a zombie state: not open, but handle not freed
  • Subsequent session.close() calls see closed == true and skip closeAction
  • The native handle is permanently leaked

The PR description mentions "Fix native handle leak on pre-validation failure" as one of the 18 previously-addressed findings, but the current code still has this gap for failures that occur before the closeAction runs inside BackupExecutor.

Fix: closed should only be set true once the close action has actually been invoked. One approach: propagate pre-validation exceptions without touching closed, and only mark the session closed after closeAction succeeds. A cleaner approach is to have BackupExecutor signal whether the close action ran (e.g. a boolean [] sentinel, or a wrapper exception type), so the session can decide whether to mark itself closed.

Fix this →


Summary

The implementation is in very good shape — all round-1 findings were resolved cleanly. Finding #6 (Javadoc on files()) remains but is cosmetic. The new finding above is a real resource-leak bug affecting the pre-validation error path; the happy path and post-close error paths are all correct.

EmbeddedSession and ServerSession set closed=true in a finally block
during backup(), even when BackupExecutor threw before invoking
closeAction (e.g. destination-inside-source or non-empty-destination).
This leaked the native handle since subsequent close() calls were no-ops.

Add PreValidationFailure marker exception to BackupExecutor so sessions
can distinguish pre-validation failures (session stays open) from
post-closeAction failures (session marked closed to prevent double-free).
@tazarov

tazarov commented Mar 27, 2026

Copy link
Copy Markdown
Contributor Author

@claude

@claude

claude Bot commented Mar 27, 2026

Copy link
Copy Markdown

Claude finished @tazarov's task in 1m 18s —— View job


Code Review — feat(09): backup API (round 3)

  • Read key changed files
  • Check if round-2 findings were addressed
  • Post final review

Round-2 Finding — Status

# Finding Status
Bug Native handle leaked on pre-validation failure (EmbeddedSession, ServerSession) ✅ Fixed
Cosmetic BackupManifest.files() returns empty list when includeMetadata=false — no Javadoc Still present (cosmetic only)

Fix Verification

The pre-validation leak fix is clean. The solution introduces a PreValidationFailure sentinel exception in BackupExecutor (lines 39–41) that wraps any RuntimeException thrown during path validation (lines 65–67). Both session classes catch it separately:

// EmbeddedSession.java:130-139 (ServerSession mirrors this)
try {
    BackupResult<EmbeddedSession> result = backupAction.apply(options);
    closed.set(true);           // set only on success
    return result;
} catch (BackupExecutor.PreValidationFailure e) {
    throw (RuntimeException) e.getCause();  // no closed.set — session stays alive ✅
} catch (RuntimeException e) {
    closed.set(true);           // close action ran (or tried), mark invalidated ✅
    throw e;
}

This correctly handles all three states:

  • Pre-validation failure → session stays open, handle not leaked, close() still works
  • Copy/manifest failure → session invalidated (close action ran)
  • Success → session invalidated (close action ran, new session returned)

backupPreValidationFailureLeavesSessionOpen test in both EmbeddedSessionTest:287 and ServerSessionTest:333 verifies the contract: exception is unwrapped correctly, session.handle() still works, and the explicit close() call still invokes closeAction exactly once.


One Remaining Cosmetic Note

BackupManifest.java line 53

public List<BackupFileMetadata> files() { 
    return files == null ? Collections.emptyList() : Collections.unmodifiableList(files); 
}

When includeMetadata=false, callers receive an empty list rather than null, with no Javadoc to explain the distinction between "no files found" and "metadata not collected". The includeMetadata() accessor lets callers disambiguate, but it requires two calls. A one-line Javadoc note (e.g., /** Returns the per-file metadata list, or an empty list if {@link #includeMetadata()} is false. */) would make the API self-documenting. This is the last outstanding item from round 1 — fine to address in a follow-up or inline here.


Summary

All substantive findings across both rounds are resolved. The PreValidationFailure sentinel pattern is a clean, type-safe fix that avoids polluting the exception hierarchy or adding boolean flags. The implementation is ready to merge.

@tazarov tazarov merged commit 64fb198 into main Mar 27, 2026
11 checks passed
@tazarov tazarov deleted the feat/phase-09-backup-api branch March 27, 2026 15:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant