Skip to content

Phase 10: Server Maintenance#82

Merged
tazarov merged 22 commits into
mainfrom
feat/phase-10-server-maintenance
Mar 29, 2026
Merged

Phase 10: Server Maintenance#82
tazarov merged 22 commits into
mainfrom
feat/phase-10-server-maintenance

Conversation

@tazarov

@tazarov tazarov commented Mar 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Phase 10: Server Maintenance
Goal: Implement stop-embed-op-restart orchestration for all maintenance operations on ServerSession
Status: Verified (9/9 must-haves)

Adds server-side maintenance operations (rebuild, compact, compact-all, WAL prune, WAL prune-all) to the Java bindings. Each operation stops the server, starts a temporary embedded session, runs the maintenance operation, closes embedded, and restarts the server — with comprehensive error handling for all failure combinations.

Changes

Plan 10-01: Core maintenance API

  • MaintenanceResult<R,S> — generic result container with partial-failure semantics (session may be null if restart failed)
  • MaintenanceExecutor — static orchestrator implementing stop-embed-op-restart with 8-branch error matrix
  • ServerSession — 5 new maintenance methods via extracted executeMaintenanceOp template, lock-protected with lifecycleLock
  • JNA and Panama backends wired with try-finally close lambdas guaranteeing serverFree runs

Key files created:

  • java/core/src/main/java/tech/amikos/chroma/local/core/MaintenanceResult.java
  • java/core/src/main/java/tech/amikos/chroma/local/core/MaintenanceExecutor.java

Key files modified:

  • java/core/src/main/java/tech/amikos/chroma/local/core/ServerSession.java
  • java/jna/src/main/java/tech/amikos/chroma/local/jna/JnaChromaRuntime.java
  • java/panama/src/main/java/tech/amikos/chroma/local/panama/PanamaChromaRuntime.java

Plan 10-02: Integration tests

  • JNA and Panama integration tests for all 5 operations with real server lifecycle
  • HTTP-based data seeding and verification via Chroma v2 REST API
  • Null-rejection, post-close, and nonexistent collection error tests

Key files created:

  • java/jna/src/test/java/tech/amikos/chroma/local/jna/JnaServerMaintenanceTest.java
  • java/panama/src/test/java/tech/amikos/chroma/local/panama/PanamaServerMaintenanceTest.java

Review fixes

  • Fixed session invalidation on validation errors (null args no longer kill session)
  • Wrapped closeServerAction in try-catch
  • Enforced MaintenanceResult co-invariant
  • Added MaintenanceExecutorTest covering all error matrix branches
  • Added session invalidation + constructor null-rejection tests

Verification

  • Automated verification: 9/9 must-haves passed
  • UAT: 6/6 scenarios passed
  • Code review: 14 issues found and resolved
  • Follow-up review: 0 critical issues remaining

Test plan

  • gradle :core:test — unit tests (MaintenanceExecutorTest, ServerSessionTest)
  • make test-java — JNA + Panama integration tests (requires CHROMA_LIB_PATH)
  • make lint — Go + Rust linters pass

tazarov added 21 commits March 27, 2026 19:32
- MaintenanceResult<R,S> carries operation result, new session, and optional restart error
- MaintenanceExecutor.execute() implements Go's stop-embed-op-restart error matrix
…ma backends

- Expand ServerSession constructor from 7 to 12 params with 5 maintenance callbacks
- Replace UnsupportedOperationException stubs with lock-protected callback delegation
- Wire MaintenanceExecutor.execute in both JNA and Panama doStartServer()
- Update ServerSessionTest for new constructor signature
…tenance

# Conflicts:
#	.planning/ROADMAP.md
#	.planning/STATE.md
- 5 data-seeded tests for rebuild, compact, compactAll, pruneWAL, pruneAllWAL
- HTTP helpers for collection creation and heartbeat verification
- Error path tests: throws-after-close and rejects-null-options (split per operation)
- Each test verifies stop-embed-op-restart lifecycle with collection survival
- Mirror of JNA tests with PanamaChromaRuntime substitution
- 11 tests: 5 data-seeded ops, 1 throws-after-close, 5 rejects-null-options
- Identical test structure ensures backend parity
- SUMMARY.md with 11 tests per backend, self-check passed
- STATE.md updated: phase 10 complete, all 11 plans done
- ROADMAP.md and REQUIREMENTS.md updated with plan progress
- Fix session invalidation on validation errors (null args no longer
  kill session)
- Wrap closeServerAction in try-catch with ChromaException
- Guarantee serverFree runs via try-finally in close lambdas
- Enforce MaintenanceResult co-invariant (session/restartError)
- Extract executeMaintenanceOp helper, rename backupLock to
  lifecycleLock
- Simplify embedded-start recovery path
- Add MaintenanceExecutorTest covering all error matrix branches
- Add session invalidation + constructor null tests to ServerSessionTest
- Fix integration test session leak with try-with-resources
- Add nonexistent collection test to JNA and Panama
@tazarov

tazarov commented Mar 29, 2026

Copy link
Copy Markdown
Contributor Author

@claude PTAL

@claude

claude Bot commented Mar 29, 2026

Copy link
Copy Markdown

Claude finished @tazarov's task in 2m 6s —— View job


PR Review: Phase 10 - Server Maintenance

  • Read core files (MaintenanceResult, MaintenanceExecutor, ServerSession)
  • Read backend implementations (JNA, Panama)
  • Read test files
  • Post review feedback

Overall this is a well-structured implementation. The 8-branch error matrix in MaintenanceExecutor is clearly documented and thoroughly tested. The session invalidation semantics are consistent with the existing backup pattern. A few findings below:


MaintenanceExecutor.javarestartError object lost in closeError + restartError branch

Lines 82–89 — when both closeError and restartError are non-null, the returned MaintenanceResult wraps a ChromaException whose cause is closeError, while restartError is only represented as a string in the message. Callers who want to inspect or log the restart exception programmatically cannot access it:

return new MaintenanceResult<>(result, null,
        new ChromaException(
                "close embedded failed: " + closeError.getMessage()
                        + "; restart failed: " + restartError.getMessage()
                        + "; server remains stopped",
                closeError));
// restartError is never addSuppressed'd here — it's silently dropped

The opError + restartError branch correctly uses addSuppressed (line 74–77), so the fix would be consistent: combined.addSuppressed(restartError) after constructing the ChromaException. Fix this →


MaintenanceExecutor.javaError from operation leaks embedded session

Lines 43–55 — Steps 3 and 4 are separate try/catch blocks catching only RuntimeException. If the operation lambda throws an Error (e.g. OutOfMemoryError), it propagates directly, skipping step 4 (close embedded). The embedded session leaks:

try {
    result = operation.apply(embedded);
} catch (RuntimeException e) { opError = e; }   // Error escapes here

// Step 4 never runs — embedded stays open
try { embedded.close(); } catch (RuntimeException e) { closeError = e; }

Since EmbeddedSession implements AutoCloseable, wrapping embedded in a try-with-resources for steps 3+4 would guarantee cleanup regardless of Error vs RuntimeException. This is admittedly an extreme edge case but the fix is trivial.


MaintenanceResult.javarestartError typed as Exception not RuntimeException

Line 17restartError is declared Exception but all callers only ever supply RuntimeException subclasses. Callers accessing restartError() then need to handle a checked Exception type for a value that can only be a RuntimeException at runtime. Minor, but widening the declared type beyond what's actually possible reduces type-level safety. Fix this →


JnaChromaRuntime.java / PanamaChromaRuntime.java — repeated close lambda

JNA lines 149/153/157/161/165, Panama lines 291/295/299/303/307 — the closeServerAction lambda () -> { try { serverStop(handle); } finally { serverFree(handle); } } is duplicated 5 times in each backend. It's readable as-is but a one-liner local variable before the new ServerSession(...) call would eliminate the repetition and make the constructor call less dense. This is style feedback only, not a correctness issue.


Observations (no action required)

  • Lock hold time: executeMaintenanceOp holds lifecycleLock for the entire maintenance cycle (stop → embed → op → restart), which can be long. This is intentional per the design, but callers should be aware that concurrent close() calls will block for the full duration.
  • Error matrix coverage: MaintenanceExecutorTest covers all 8 branches cleanly. The null-rejection and PreValidationFailure-style tests in ServerSessionTest correctly mirror the backup lifecycle tests.
  • Duplicate test class structure: JnaServerMaintenanceTest and PanamaServerMaintenanceTest appear structurally identical (both 380 lines). This is consistent with the JNA/Panama pattern in the codebase for integration tests.

Summary: The implementation is solid. The restartError dropped-as-suppressed issue (finding 1) is the most meaningful — it's observable by callers in a real failure scenario. The Error leak and type widening are lower priority but easy wins. Everything else is clean.

Add restartError as suppressed exception in closeError+restartError
branch for consistency with opError+restartError branch. Narrow
MaintenanceResult.restartError from Exception to RuntimeException
to match actual usage.
@tazarov

tazarov commented Mar 29, 2026

Copy link
Copy Markdown
Contributor Author

@claude

@claude

claude Bot commented Mar 29, 2026

Copy link
Copy Markdown

Claude finished @tazarov's task in 59s —— View job


PR Review: Phase 10 - Server Maintenance (follow-up)

  • Read recent commits to understand what was fixed
  • Read MaintenanceExecutor.java — check suppressed exception fix
  • Read MaintenanceResult.java — check type narrowing fix
  • Read JNA/Panama backends — check close lambda issue
  • Post updated review

Two of the four previous findings are resolved in commit 5559c98:

Finding 1 — restartError suppressed ✅ Fixed
combined.addSuppressed(restartError) is now called before returning in the closeError + restartError branch (lines 88–90). Consistent with the opError + restartError branch.

Finding 3 — restartError type narrowed ✅ Fixed
MaintenanceResult.restartError field, constructor parameter, and accessor are all now RuntimeException. Callers no longer need to handle a checked Exception.


Finding 2 — Error from operation leaks embedded session — still open

Steps 3 and 4 remain in separate try/catch (RuntimeException) blocks. A java.lang.Error thrown by operation.apply(embedded) still escapes, skipping embedded.close(). Given EmbeddedSession holds a native handle, the leak is permanent for the process lifetime (not just GC-reachable). Wrapping steps 3+4 in a single try-with-resources would close this:

// Before (current)
try { result = operation.apply(embedded); } catch (RuntimeException e) { opError = e; }
try { embedded.close(); } catch (RuntimeException e) { closeError = e; }

// After
try (EmbeddedSession emb = embedded) {
    result = operation.apply(emb);
} catch (RuntimeException e) {
    opError = e;
} catch (Exception e) {   // close() threw — record as closeError
    closeError = new ChromaException("embedded close failed", e);
}

Fix this →

Finding 4 — Duplicate close lambda — style only, no action required (same as before)


Overall the implementation is in good shape. Finding 2 is the only remaining correctness issue — it's an edge case but worth a one-time fix given the native handle exposure.

@tazarov tazarov merged commit dd1b757 into main Mar 29, 2026
11 checks passed
@tazarov tazarov deleted the feat/phase-10-server-maintenance branch March 29, 2026 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant