This plan reflects Lode’s clarified scope: persistence structure only. It introduces an explicit contract-freezing phase before any implementation details harden.
Lode’s core principle:
Lode defines structure, not execution.
It stores facts, not interpretations.
Data Producer (any language / system)
↓
Lode Write API (structure + metadata)
↓
Storage Adapter (FS, S3, etc.)
↓
Immutable Objects + Manifests
↓
External Readers (query engines, tools)
Establish repo discipline and enforce boundary constraints before adding behavior.
- Repo skeleton with
lode/,internal/,docs/,examples/ AGENTS.mdconstraints applied- Taskfile scaffolding (even if minimal)
- Guardrails exist and are visible
- Task targets are present (build / test / lint)
- No execution logic introduced
Freeze interfaces and invariants required for parallel implementation, without committing to internal details.
This phase produces authoritative documents, not code.
Defines:
- Dataset and snapshot identity
- Manifest self-description requirements
- Metadata explicitness and persistence
- Immutability rules and linear history
Defines:
Dataset.Writesemantics and error behavior- Snapshot creation and parent linkage
- Empty dataset behavior
- Prohibited behaviors (no in-place mutation)
Defines:
- Adapter obligations for
Put/Get/List/Exists/Delete - Safe write semantics (no overwrite)
- Atomic commit expectations (manifest presence as commit signal)
Defines:
- Logical key layout ownership
- Partitioner / compressor / codec roles
- NoOp components and explicit recording
- Optional codec semantics for raw-byte data
- Canonical object key persistence
Defines:
- Object iterator lifecycle semantics
- Close/Err/Next behavior guarantees
Defines:
- Adapter-agnostic read surface
- Range-read requirements
- Manifest-driven discovery
Defines:
- Public entrypoints and defaults
- Option-based configuration shape
- Default bundle behavior
- All contract documents exist
- Ambiguity is resolved by contract references
- Phase 0 plan assertions are captured in contracts
- Public API spec exists and documents defaults and options
Implement the minimal, end-to-end write path with immutable snapshots.
- Dataset abstraction
- Manifest schema and persistence
- Snapshot creation with explicit metadata
- Safe write semantics enforced
- Write → manifest → snapshot works
- No metadata inference or defaults
- Empty dataset behavior matches contracts
Make storage and format choices pluggable without changing semantics.
- Filesystem and in-memory adapters
- JSONL codec
- Gzip compression
- Hive-style partitioner
- NoOp components for partitioning/compression
- Unified Layout abstraction (publicly single, internally composable)
- Default layout (dataset-modeled, segment-anchored, flat)
- Curated layout implementations (default + hive-style + flat)
- Manifests record component names
- No nil component handling
- Layout is portable across adapters
Provide a minimal read facade for external consumers.
- Dataset / partition / segment listing
- Manifest access
- Object open and random access via adapter
- Layout-aware discovery and path parsing
- Layout-specific dataset errors (when datasets are not modeled)
- Range reads are true range reads
- Manifests are sufficient for discovery
- No planning or query logic added
Stabilize behavior and validate invariants with tests and examples.
- Error taxonomy and test coverage
- Example: write → list → read
- Example: single-object blob upload (public API only)
- Example: manifest-driven discovery (pattern, not a layout)
- Explicit metadata visibility on disk
- Examples for default layout and hive-style layout
- Tests enforce immutability
- Examples align with contracts
- Single-object blob example uses default bundle and explicit metadata
- No hidden state detected
Explore new adapters or codecs without expanding the public API.
- S3 adapter (if needed)
- Parquet codec
- Manifest stats (optional, additive)
- Minimal API surface (only
s3.Newands3.Configexported) - No backend-specific conditionals
- CONTRACT_STORAGE.md compliance verified
- Zstd compressor added as an additive compression option
- Parquet codec implemented
- Manifest stats extensions finalized (additive)
Status: Promoted to public API (lode/s3)
The S3 adapter is a stable public component supporting AWS S3, MinIO, LocalStack, Cloudflare R2, and other S3-compatible object stores. Client construction uses AWS SDK directly (see PUBLIC_API.md for examples).
Location: lode/s3/
Consistency notes:
- S3 provides strong read-after-write consistency (since December 2020)
- List operations are also strongly consistent
- Commit semantics rely on manifest presence: write data objects before manifest
Integration testing:
- Gated behind
LODE_S3_TESTS=1environment variable - Requires Docker Compose for LocalStack/MinIO services
- Run:
task s3:up && task s3:test && task s3:down
Usage:
import (
"github.com/aws/aws-sdk-go-v2/config"
awss3 "github.com/aws/aws-sdk-go-v2/service/s3"
"github.com/pithecene-io/lode/lode/s3"
)
// AWS S3 (use standard AWS SDK config)
cfg, _ := config.LoadDefaultConfig(ctx)
client := awss3.NewFromConfig(cfg)
store, _ := s3.New(client, s3.Config{Bucket: "my-bucket", Prefix: "datasets/"})Any change that affects contract behavior must:
- Update the relevant
docs/contracts/CONTRACT_*.mdfile. - Include a compatibility note (breaking vs additive).
- Avoid expanding public API without explicit justification.
- Evaluate and adopt stronger S3 conditional-write guarantees for multipart paths
- Update storage contract/docs with backend-specific guarantee notes where needed
- Add integration coverage for large-upload overwrite prevention paths
- Prioritize Parquet codec delivery
- Define additive manifest stats needed for Parquet-oriented pruning workflows
- Add/refresh examples for columnar and streaming workflows
- Design reference-based path for GRIB2/NetCDF-to-Zarr workflows within Lode boundaries
- Define docs/examples for chunk/region safety expectations
- Evaluate native Zarr encoding path after reference-based workflow is validated
Status: Validated — no new API surface required.
Lode's existing persistence primitives cover the full vector artifact lifecycle:
- Embedding batches as raw blob snapshots (
Write) - Serialized indices via
StreamWrite(large binary payloads) Latest()as an atomic active-index pointer- Snapshot history for rollback and version progression
- Explicit metadata for provenance and rebuild contracts
Scope boundary: Lode is durability infrastructure, not a vector database. It stores embedding batches and serialized indices as opaque blobs. Similarity search, index construction, and query execution remain the caller's responsibility.
Deliverables:
-
examples/vector_artifacts/— end-to-end pipeline example
Deferred:
- Custom codec or format work for vector-specific encodings (not needed; raw blobs suffice)
- ANN index introspection or validation (execution concern, out of scope)
Introduce a second first-class persistence paradigm, Volume, alongside Dataset,
without blurring Lode's scope boundary.
Dataset streaming builds complete objects before committing truth; Volume streaming commits truth incrementally.
Datasetremains complete, structured, object-centric persistenceVolumeadds sparse, resumable, range-addressable byte-space persistence- Both abstractions are coequal and explicit; neither is implemented as a special case of the other
- Runtime concerns (torrent peers, scheduling, networking, execution policy) remain outside Lode
- Public
Volumeabstraction and constructor shape documented - Volume snapshot manifest model documented (
volume_id,total_length, committed blocks) - Explicit range-read semantics documented (missing ranges are errors, no zero-fill inference)
- Filesystem and memory validation examples for sparse/resumable flows
- Restart/resume correctness tests for committed blocks
- Clarify in docs and API guidance that Dataset streaming is atomic object construction
- Preserve guarantee: object is either fully committed in snapshot or absent
- Explicitly route sparse/resumable/range-verified workflows to Volume
- No torrent protocol concepts (pieces, peers, trackers)
- No networking, scheduling, or runtime orchestration
- No hash policy ownership in Lode core
- No deployables/daemons added to Lode
| ID | Decision | Resolution |
|---|---|---|
| DD-1 | Naming disambiguation | Reader → DatasetReader. SegmentRef → ManifestRef (DatasetReader), BlockRef (Volume). ListSegments → ListManifests. Volume uses composed VolumeWriter + VolumeReader interface. |
| DD-2 | Manifest model | Cumulative. Each snapshot is self-contained. |
| DD-3 | Storage layout | Fixed internal layout. Layouts are Dataset-specific. |
| DD-4 | Staged data lifecycle | Direct to final path. Manifest = visibility. No Abort API. |
| DD-5 | Resume semantics | Explicit/caller-driven. No uncommitted data discovery. |
| DD-6 | VolumeOption | WithVolumeChecksum only for v0.6. |
| DD-7 | Type naming | Snapshot → DatasetSnapshot, SnapshotID → DatasetSnapshotID. Volume gets symmetric types. |
| DD-8 | Overlap validation | Strict on full cumulative manifest. Supersession explored later. |
Additional resolved items:
- Empty commits (no new blocks) are invalid
- No block count limit for v0.6
ErrOverlappingBlockssentinel added- S3 validation deferred to post-v0.6
- Unix nanosecond IDs (consistent with Dataset)
- Future roadmap: CAS (multi-process) + parallel staging (single-process)
- Finalize
CONTRACT_VOLUME.mdas authoritative (remove draft, apply DD-1–DD-8) - Add concurrency matrices to
CONTRACT_WRITE_API.mdandCONTRACT_VOLUME.md - Add
ErrOverlappingBlockstoCONTRACT_ERRORS.md - Add Layout scope note to
CONTRACT_LAYOUT.md(Dataset-specific) - Update
PUBLIC_API.mdVolume section with finalized API surface - Update
IMPLEMENTATION_PLAN.mdwith resolved decisions
- Rename
Snapshot→DatasetSnapshot,SnapshotID→DatasetSnapshotIDinapi.go - Rename
Reader→DatasetReader,NewReader→NewDatasetReader - Rename
SegmentRef→ManifestRef,ListSegments→ListManifestsin DatasetReader API - Add Volume types:
VolumeID,VolumeSnapshotID,BlockRef,VolumeManifest,VolumeSnapshot - Add sentinels:
ErrRangeMissing,ErrOverlappingBlocks - Add
VolumeOptiontype withWithVolumeChecksum - Cascade renames through all code, tests, and examples
-
NewVolumeconstructor with validation -
StageWriteAt: write data to final path, returnBlockRef -
Commit: validate, build cumulative manifest, write to store -
ReadAt: resolve range from manifest, read from store -
Latest/Snapshots/Snapshotfor history access - Fixed layout:
volumes/<id>/snapshots/<snap>/manifest.json,volumes/<id>/data/<offset>-<length>.bin - Overlap validation (sort + sweep on full cumulative block set)
- Works with fsStore and memoryStore
- Sparse write patterns, multi-snapshot progression
- Resume pattern: new Volume instance, load latest, continue staging
- Edge cases: zero-length read, boundary reads, offset validation
- Missing range errors, overlap rejection
- Checksum validation when configured
- FS and memory store coverage
- Manifest round-trip (serialize → deserialize → validate)
- Runnable
examples/volume_sparse/demonstrating stage → commit → read - Demonstrates
ErrRangeMissingfor uncommitted ranges - Demonstrates incremental commits and completeness
- Follows
CONTRACT_EXAMPLES.mdconventions
- Remove all "planned"/"draft" language from shipped Volume features
- Update README with Volume overview
- Update
CONTRACT_TEST_MATRIX.mdwith Volume coverage - Final cross-reference validation across all docs