Skip to content

Phase 6: Core Foundation Types#76

Merged
tazarov merged 36 commits into
mainfrom
feature/java-api-surface
Mar 26, 2026
Merged

Phase 6: Core Foundation Types#76
tazarov merged 36 commits into
mainfrom
feature/java-api-surface

Conversation

@tazarov

@tazarov tazarov commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Phase 6: Core Foundation Types
Goal: The core module contains all shared interfaces, builders, result types, and FFI safety infrastructure so that backend modules (JNA, Panama) can implement against stable contracts without any FFI dependency in core.
Verification: Passed (14/14 must-haves, 128 tests, 0 failures)

Establishes the complete Java type system for the v0.5.0 milestone. Creates 21 production classes in the core module covering result POJOs, option/request builders, config builders, FFI safety infrastructure, and server session lifecycle — all with zero JNA or Panama dependencies.

Changes

Plan 06-01: Result POJOs + Dependencies

Added Gson 2.13.2 and SnakeYAML 2.6 to core module. Created JsonUtil with shared Gson instance (LOWER_CASE_WITH_UNDERSCORES policy). Implemented 7 result POJOs: RebuildCollectionResult, CompactionCollectionResult, CompactionResult, WALPruneCollectionResult, WALPruneResult, BackupFileMetadata, BackupManifest. All use final fields, package-private constructors, accessor-style methods, and Collections.unmodifiableList() for list returns.

Plan 06-02: Option/Request Types + Config Builders

Created 5 option/request types with nested Builders and toJson() serialization: RebuildOptions, WALPruneOptions, BackupOptions, CompactCollectionRequest, CompactAllRequest. Created ServerConfigBuilder and EmbeddedConfigBuilder producing YAML output matching Go's config format, with rawYaml() escape hatch and validation at build() time. Extracted Validation utility for shared tenant/database name checks.

Plan 06-03: FFI Safety Infrastructure + ServerSession

Created AbstractChromaRuntime with static ReentrantLock and template methods (callFfiHandle, callFfiJson, callFfiVoid, callFfiBorrowedString) for safe FFI call patterns. Extended ChromaRuntime interface with startServer(String). Created ServerSession with callback slots for stop/free/port/address/persistPath, try-finally close semantics, and maintenance method stubs for Phases 8-10.

Post-implementation hardening

  • Session close safety: try-finally ensures freeAction always runs; non-retriable close semantics
  • FFI error checking: callFfiVoid checks readLastError() after void calls
  • JSON guards: null/empty/malformed JSON detection with ChromaException wrapping and cause chain preservation
  • Immutability: Collections.unmodifiableList() on all list-returning accessors
  • Validation consistency: tenantId validation across all builders; maxAgeSeconds/maxBytes reject negative values
  • Config builders made final; rawYaml blank validation added
  • Lock scope reduction: callFfiJson releases FFI lock before JSON deserialization

Requirements Addressed

  • FOUND-01: Core module shared types (result POJOs, option types)
  • FOUND-02: Builder pattern for server/embedded configuration
  • FOUND-03: FFI safety infrastructure (AbstractChromaRuntime with global lock)
  • FOUND-04: ServerSession type with accessor callbacks
  • FOUND-05: ChromaRuntime interface extension (startServer)
  • FOUND-06: Validation and error handling contracts

Verification

  • Automated verification: 14/14 must-haves passed
  • 128 unit tests, 0 failures
  • Zero JNA/Panama imports in core module
  • 4 rounds of automated code review (code, types, tests, errors, simplify)

Key Decisions

Decision Rationale
Boxed Long for optional numerics Nullable fields (e.g., pending compaction ops) distinguish "not present" from zero, mirroring Go's pointer-based optionality
Static ReentrantLock in AbstractChromaRuntime Process-wide lock for FFI calls — the Rust shim uses global error state (chroma_last_error)
try-finally for ServerSession close Ensures freeAction always runs even if stopAction fails; non-retriable to prevent double-free
No readLastError() drain on success paths Rust shim's chroma_get_last_error borrows but never clears the error slot
Lock released before JSON deserialization Only FFI call + string read need the lock; deserialization is pure Java

Test plan

  • All 128 core module tests pass (gradle :core:test)
  • make test-java passes (JNA + Panama smoke tests)
  • make test-all passes (Go + Rust + Java)

tazarov added 30 commits March 22, 2026 16:01
- Add Gson 2.13.2 and SnakeYAML 2.6 to core build.gradle.kts
- Create package-private JsonUtil with shared Gson instance using LOWER_CASE_WITH_UNDERSCORES
- Fix Java toolchain: use sourceCompatibility/targetCompatibility instead of strict toolchain for JDK portability
- RebuildCollectionResultTest: full JSON, missing optionals, snake_case mapping, long values
- CompactionResultTest: nested collections, boxed Long null/present semantics
- WALPruneResultTest: full result, null optional Longs, dry_run/vacuum_requested booleans
- BackupManifestTest: full manifest, empty files, metadata fields, int/long types
- RebuildCollectionResult: 12 fields matching Go rebuild.go struct
- CompactionCollectionResult: boxed Long for optional pending ops
- CompactionResult: nested List<CompactionCollectionResult>
- WALPruneCollectionResult: 5 boxed Long fields for optional sequence numbers
- WALPruneResult: nested List<WALPruneCollectionResult> with dry_run/vacuum booleans
- BackupFileMetadata: path, sizeBytes, mode, sha256, modifiedAt as String
- BackupManifest: nested List<BackupFileMetadata> with int fileCount and long totalBytes
- SUMMARY.md documenting 7 result POJOs, JsonUtil, and 15 tests
- STATE.md advanced to plan 2 of 3, performance metrics recorded
- ROADMAP.md updated with phase 06 progress (1/3 plans)
- REQUIREMENTS.md: FOUND-01 and FOUND-04 marked complete
- RebuildOptionsTest: defaults, toJson, validation rejection
- WALPruneOptionsTest: toJson, policy validation, watermark pairs
- BackupOptionsTest: toJson, destination validation, flag defaults
- CompactRequestTest: collection and all request toJson, validation
…omaRuntime interface

- AbstractChromaRuntime with static ReentrantLock and callFfiHandle/callFfiJson/callFfiVoid/callFfiBorrowedString template methods
- Three abstract methods (readBorrowedString, readOwnedString, readLastError) for backend implementation
- ChromaRuntime interface now includes startServer(String) returning ServerSession
- ServerSession with full public API surface (lifecycle, accessors, maintenance stubs)
- Tests verify lock serialization, error propagation, and template method delegation
- RebuildOptions: name required, keepBackup defaults true, toJson() matches Go FFI format
- WALPruneOptions: boxed Long for nullable policies, watermark pair validation, dryRun/policy rule
- BackupOptions: destinationPath required, includeMetadata/leaveStopped/leaveClosed flags
- CompactCollectionRequest: name required, databaseName min 3 chars validation
- CompactAllRequest: optional tenantId/databaseName with validation
- 32 tests covering defaults, toJson serialization, and validation rejection
…intenance stubs

- Constructor validation tests for zero handle and null callbacks
- Accessor delegation tests verifying handle is passed to callbacks
- URL composition test (http://address:port)
- Close idempotency and try-finally ordering tests
- ensureOpen guard verification for all public methods after close
- Maintenance method stubs throw UnsupportedOperationException when open
- ServerConfigBuilderTest: default YAML, custom values, CORS, OTel, rawYaml, validation
- EmbeddedConfigBuilderTest: default YAML, custom values, rawYaml, validation
- ServerConfigBuilder: defaults match Go's DefaultServerConfig(), CORS/OTel conditional sections
- EmbeddedConfigBuilder: defaults match Go's DefaultEmbeddedConfig(), 3-field YAML
- Both support rawYaml() escape hatch overriding all other fields
- Both validate at build() time: port range, non-null persistPath
- Golden tests parse YAML semantically to avoid quoting brittleness
- SUMMARY.md with 4 TDD commits, 49 tests, 13 files
- STATE.md updated with decisions and metrics
- ROADMAP.md updated with phase 06 progress
- REQUIREMENTS.md: FOUND-02, FOUND-03 marked complete
… calls

ServerSession.close() now wraps stopAction/freeAction in try-finally so
freeAction always runs, preventing native handle leaks on stop failure.

Removed readLastError() drain calls from callFfiHandle, callFfiJson, and
callFfiBorrowedString success paths — the Rust shim's chroma_get_last_error
borrows but never clears the error slot, making drain calls no-ops in
production. The error-checking protocol (only read errors when return value
indicates failure) is sufficient.
…serialization

- Extract Validation utility with validateTenantId/validateDatabaseName to
  eliminate 4x copy-pasted validation blocks across builder types
- Release FFI_LOCK before JSON deserialization in callFfiJson to reduce
  lock contention — only the FFI call and string read need the lock
- Fix maxBytes error message to use snake_case (max_bytes) for consistency
tazarov added 2 commits March 23, 2026 13:04
Both runtimes were missing the startServer(String) method required by
the ChromaRuntime interface, causing compilation failures on all CI
Java jobs.
@tazarov

tazarov commented Mar 23, 2026

Copy link
Copy Markdown
Contributor Author

@claude PTAL

@claude

claude Bot commented Mar 23, 2026

Copy link
Copy Markdown

Claude finished @tazarov's task in 3m 17s —— View job


Phase 6 Code Review

  • Gather context
  • Review core production classes
  • Review FFI safety infrastructure
  • Review config builders
  • Review tests
  • Post findings

Overall this is a well-structured phase. The type system is clean, builder validation is thorough, and the try-finally close semantics in ServerSession are correct. A few issues are worth addressing before merge.


Bug: Panama serverStop will throw WrongMethodTypeException at runtime

PanamaChromaRuntime.java:213

// chromaServerStop handle has type (MemorySegment)int — NOT void
MethodHandle chromaServerStop = linker.downcallHandle(...,
    FunctionDescriptor.of(ValueLayout.JAVA_INT, ValueLayout.ADDRESS));  // returns int

// serverStop calls it as void — invokeExact enforces exact type match
chromaServerStop.invokeExact(MemorySegment.ofAddress(handleAddress));  // WrongMethodTypeException

MethodHandle.invokeExact enforces the full type signature at the call site. When the handle returns int but the call site is void, a WrongMethodTypeException is thrown at runtime. This means every ServerSession.close() call through the Panama backend will fail with an exception from stopAction. The try-finally in ServerSession.close() means freeAction still runs (good), but the user sees a spurious error.

JNA is fine — its proxy silently discards the return value. Panama is strict.

Fix options:

  • Capture the return: int ignored = (int) chromaServerStop.invokeExact(MemorySegment.ofAddress(handleAddress));
  • Or use FunctionDescriptor.ofVoid(ValueLayout.ADDRESS) if the stop result is intentionally ignored

The same pattern applies to any future Panama void calls that wrap functions returning error codes. Fix this →


Design concern: AbstractChromaRuntime is unused by both backends

AbstractChromaRuntime.java establishes the static FFI_LOCK and template methods, but JnaChromaRuntime and PanamaChromaRuntime both implement ChromaRuntime directly and never extend AbstractChromaRuntime. The lock protects nothing in the current code.

This appears intentional for Phase 8 wiring, but it means:

  1. The FFI lock in AbstractChromaRuntime is dead code today
  2. If JNA/Panama are used concurrently, the global error state (chroma_last_error) is unprotected
  3. AbstractChromaRuntimeTest tests an unused path

Not a blocker if Phase 8 hooks it up, but worth a // wired in Phase 8 comment on the class to avoid future confusion.


Design inconsistency: callFfiVoid always reads the error slot

AbstractChromaRuntime.java:57-68

protected void callFfiVoid(Runnable ffiCall) {
    FFI_LOCK.lock();
    try {
        ffiCall.run();
        String error = readLastError();  // reads error slot on EVERY call
        if (error != null && !error.isEmpty()) {
            throw new ChromaException("FFI call failed: " + error);
        }
    } ...
}

The PR's Key Decisions state: "No readLastError() drain on success paths — Rust shim's chroma_get_last_error borrows but never clears the error slot."

But callFfiVoid does read the error slot unconditionally. If the shim's error slot retains a value from a previous failed call, any subsequent successful void call will spuriously throw. This contradicts the stated design.

callFfiHandle and callFfiBorrowedString are consistent (only read on null/0 return), but callFfiVoid is not. Since void calls have no return to inspect, the error-slot check may be the only option — but this needs either a mechanism to clear the slot after reading, or careful documentation. Fix this →


Minor: chroma_server_stop return value not checked in JNA

JnaChromaRuntime.java:139-152

The JNA binding declares int chroma_server_stop(Pointer handle), but serverStop ignores the return value. If the Rust shim returns a non-zero error code on stop failure, the error is silently swallowed. At minimum this matches Panama (which also ignores it), but worth noting. The Panama fix above is the more urgent concern.


Minor: BackupManifest and result POJOs return null for absent lists

BackupManifest.java:39, RebuildCollectionResult.java:46, WALPruneResult.java:43

public List<String> sourcePaths() { 
    return sourcePaths == null ? null : Collections.unmodifiableList(sourcePaths); 
}

All list accessors can return null when the field wasn't in the JSON response. Callers must null-check before iterating, which is easy to forget. The rationale (distinguish "not present" vs "empty") is valid, but consider documenting the null contract or offering an orEmpty() helper. Not a blocker.


Nit: url() in ServerSession hardcodes http://

ServerSession.java:47-49 — acceptable for an embedded server, but a @SuppressWarnings or inline comment noting that TLS is out of scope for v0.5.0 would help future maintainers.


Summary

Severity Issue File
Bug Panama serverStop throws WrongMethodTypeException PanamaChromaRuntime.java:213
Design AbstractChromaRuntime dead code until Phase 8 AbstractChromaRuntime.java
Design callFfiVoid reads stale error slot AbstractChromaRuntime.java:61
Minor chroma_server_stop return not checked JnaChromaRuntime.java:142
Minor List accessors return null result POJOs

The Panama serverStop bug is the only issue that will cause a test failure at make test-java (if the smoke test calls close() on a ServerSession). Everything else is either a future concern or a documentation/design note.

Panama's invokeExact enforces exact type matching — discarding the int
return caused WrongMethodTypeException at runtime. JNA silently ignored
the error code; now checks rc and throws on failure.
@tazarov

tazarov commented Mar 25, 2026

Copy link
Copy Markdown
Contributor Author

@claude review

@claude

claude Bot commented Mar 25, 2026

Copy link
Copy Markdown

Claude finished @tazarov's task in 3m 7s —— View job


Phase 6 Code Review

  • Gather context
  • Review core production classes (result POJOs, option types)
  • Review FFI safety infrastructure (AbstractChromaRuntime)
  • Review config builders
  • Review ServerSession
  • Review JNA/Panama runtime changes
  • Review tests
  • Post findings

Good incremental progress. The previous review's main bugs are resolved. Two latent issues remain worth addressing before Phase 8 wires things up, and a few minor observations below.


Fixed since last review ✅

Issue Status
Panama serverStop WrongMethodTypeException Fixed — PanamaChromaRuntime.java:213 captures int ignored = (int) chromaServerStop.invokeExact(...)
JNA serverStop return not checked Fixed — JnaChromaRuntime.java:142-144 now throws on rc != 0
url() hardcodes http:// without comment Fixed — comment added on ServerSession.java:47

Design: AbstractChromaRuntime is still dead code

Neither JnaChromaRuntime nor PanamaChromaRuntime extends AbstractChromaRuntime — both implement ChromaRuntime directly. The static FFI_LOCK in AbstractChromaRuntime protects nothing in production, and AbstractChromaRuntimeTest exercises code that is not reachable through either backend.

This is clearly pre-wired for Phase 8, and the class is well-designed. The ask is the same as last review: a // wired in Phase 8 comment on the class declaration would prevent future readers from treating it as unused infrastructure or silently deleting it.


Bug: callFfiVoid will read stale error from Rust shim

AbstractChromaRuntime.java:61

protected void callFfiVoid(Runnable ffiCall) {
    FFI_LOCK.lock();
    try {
        ffiCall.run();
        String error = readLastError();  // called unconditionally
        if (error != null && !error.isEmpty()) {
            throw new ChromaException("FFI call failed: " + error);
        }
    } finally { ... }
}

The shim's last_error_cstring() (shim/src/lib.rs:95-102) reads LAST_ERROR without clearing it:

fn last_error_cstring() -> Option<CString> {
    let slot = match LAST_ERROR.lock() { ... };
    let msg = slot.as_ref()?;  // borrows — slot is not cleared
    CString::new(msg.as_str()).ok()
}

So after any failing call, LAST_ERROR retains its value indefinitely. A subsequent successful void call through callFfiVoid will read the old error and throw a false-positive ChromaException.

The unit test at AbstractChromaRuntimeTest.java:180 (callFfiVoid_notAffectedByStaleError_whenProtocolFollowed) masks this because the TestChromaRuntime mock clears the value on read (lastErrorValue = null), unlike the real shim.

Two viable fixes:

  • Add a clearLastError() shim function and call it at the start of each FFI call path (the cleanest contract)
  • Skip readLastError() in callFfiVoid and rely on the Rust function returning an error code instead of having side effects in the error slot — but this only works if void-returning shim calls always have a paired indicator

Worth resolving before Phase 8, since callFfiVoid will be the path for maintenance operations. Fix this →


Minor: WALPruneOptions.Builder(String name) skips name validation

WALPruneOptions.java:56-58, 100-124

public Builder(String name) {
    this.name = name;  // blank/empty accepted without validation
}

RebuildOptions.Builder.build() validates name is non-blank, but WALPruneOptions.Builder.build() does not. A caller can do new WALPruneOptions.Builder(" ").dryRun(true).build() and get a whitespace name silently passed to the API. If name is required for per-collection prune, the no-arg constructor intent (prune-all with no name) is fine, but the named constructor should validate the name isn't blank. Not a blocker since Phase 8 will catch this at the API layer, but worth aligning. Fix this →


Minor: Null list accessor contract is undocumented

BackupManifest.java:39, WALPruneResult.java:43, CompactionResult.java:26, RebuildCollectionResult.java:46

public List<String> sourcePaths() {
    return sourcePaths == null ? null : Collections.unmodifiableList(sourcePaths);
}

All list accessors can return null when the field was absent in the JSON response. The null-vs-empty distinction mirrors Go's pointer semantics, which is a reasonable design. But none of the methods document this — callers have no indicator that null (field absent) is distinct from an empty list. A @return Javadoc noting the nullable contract, or an isEmpty() helper that coerces null to empty, would make this safer to use. Not blocking.


Nit: PanamaChromaRuntime.close() rolls back closed on exception

PanamaChromaRuntime.java:341-342

} catch (IllegalStateException e) {
    closed.set(false);  // rolled back — allows retry

This is a deliberate retry-friendly design (documented in the error message: "ensure all EmbeddedSession instances are closed first"). The concern is that re-setting closed to false is only safe if the arena.close() failure is truly transient. If arena.close() fails halfway, attempting a second close may corrupt state. Given that IllegalStateException from Arena.close() indicates active sessions are still open (not a half-closed arena), the rollback is correct here. Worth a brief inline comment explaining the rollback rationale so it doesn't look like an error.


Summary

Severity Issue File
Design AbstractChromaRuntime dead code until Phase 8 — add comment AbstractChromaRuntime.java
Bug callFfiVoid reads stale error slot — shim never clears LAST_ERROR on read AbstractChromaRuntime.java:61 + shim/src/lib.rs:95
Minor WALPruneOptions.Builder(String name) skips name validation WALPruneOptions.java:56
Minor List accessor null contract undocumented result POJOs
Nit PanamaChromaRuntime.close() rollback rationale worth a comment PanamaChromaRuntime.java:341

The callFfiVoid stale-error issue is the only one that can cause incorrect runtime behavior once Phase 8 wires in AbstractChromaRuntime. Everything else is documentation/validation hygiene.

tazarov added 3 commits March 25, 2026 18:31
…date WALPruneOptions name

- Change Rust shim last_error_cstring/last_error_message to use
  slot.take() instead of slot.as_ref()/slot.clone() so the error
  is consumed on read, preventing false-positive ChromaExceptions
  from stale errors in callFfiVoid
- Add Phase 8 wiring comment to AbstractChromaRuntime
- Add blank name validation to WALPruneOptions.Builder.build()
  consistent with RebuildOptions
- Fix Panama serverStop silently discarding return code (now checks rc != 0)
- Fix JNA serverStop missing ChromaException rethrow guard
- Fix ServerSession.close() losing stop exception when free also throws
  (stop exception is now primary, free exception is suppressed)
- Fix EmbeddedSession.close() to roll back closed flag on failure (retriable)
- Fix chroma_server_port to set LAST_ERROR on null handle
- Add negative port validation in JNA and Panama serverPort methods
- Make ServerSession.handle() package-private to prevent raw handle leaks
- Return Collections.emptyList() instead of null from list accessors in
  BackupManifest, CompactionResult, RebuildCollectionResult, WALPruneResult
- Add sqliteFilename validation to EmbeddedConfigBuilder and ServerConfigBuilder
- Add maxPayloadSizeBytes validation to ServerConfigBuilder
- Defensively copy corsAllowOrigins in ServerConfigBuilder
- Extract validateRequiredName/validateOptionalName to Validation helper
- Document thread-safety limitation on JNA and Panama runtime classes
- Add startServer smoke tests for JNA and Panama
- Add tests for otelEndpoint without serviceName, empty corsAllowOrigins,
  config builder validation, EmbeddedSession handle accessor
The Rust shim stores the config port in ServerHandle.port without
updating it after the OS assigns an ephemeral port, so port 0 tests
always see port=0. Use findFreePort() to reserve a real port, matching
the Go test pattern (reserveFreeLoopbackPort).
@tazarov

tazarov commented Mar 26, 2026

Copy link
Copy Markdown
Contributor Author

@claude PTAL

@claude

claude Bot commented Mar 26, 2026

Copy link
Copy Markdown

Claude finished @tazarov's task in 2m 15s —— View job


Phase 6 Code Review

  • Gather context
  • Review outstanding issues from previous reviews
  • Review current state of all changed files
  • Post findings

All four issues flagged in the previous review (2026-03-25) have been resolved. This is clean and ready to merge.


Fixed since last review ✅

Issue Status Location
AbstractChromaRuntime dead code — no Phase 8 comment Fixed — // Wired in Phase 8 added at class declaration AbstractChromaRuntime.java:6
callFfiVoid reads stale error slot Fixed — Rust shim last_error_cstring() now uses slot.take() instead of slot.as_ref(), so the slot is cleared after each read shim/src/lib.rs:100
WALPruneOptions.Builder(String name) skips name validation Fixed — build() now calls Validation.validateOptionalName(name) WALPruneOptions.java:101
List accessors return null for absent JSON fields Fixed — all list accessors now return Collections.emptyList() as the absent-field sentinel (BackupManifest, WALPruneResult, CompactionResult, RebuildCollectionResult) result POJOs

Remaining nit (non-blocking)

PanamaChromaRuntime.java:354 — the closed.set(false) rollback on IllegalStateException still has no inline comment. The error message ("ensure all EmbeddedSession instances are closed first") explains the cause, but not why rolling closed back to false is safe (i.e., arena.close() failing on active sessions means the arena was not modified). A one-liner would help future readers avoid second-guessing this rollback. Not a blocker.


Summary

The phase is in good shape. All previously identified bugs and design issues are resolved. The slot.take() fix in the Rust shim is the right solution — it makes callFfiVoid's error-slot check correct and removes the false-positive risk without requiring a separate clearLastError() FFI symbol. The ServerSession close semantics (non-retriable) and the PanamaChromaRuntime close semantics (retriable on open-arena failure) are intentionally different and both correct for their respective invariants.

LGTM from a code review standpoint.

@tazarov tazarov merged commit 4acb91f into main Mar 26, 2026
11 checks passed
@tazarov tazarov deleted the feature/java-api-surface branch March 26, 2026 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant