-
Notifications
You must be signed in to change notification settings - Fork 0
Open
6 / 66 of 6 issues completedLabels
epicTop-level epic issueTop-level epic issue
Description
Parent
- Parent issue: Milestone 1 - Shardlake Post-Prototype Roadmap #5
Goal
Make dataset, embeddings, and index artifacts first-class immutable objects with deterministic reproducibility.
Why this matters
In large-scale vector systems, artifact drift and incompatible index versions are common operational failure modes.
Detailed tasks
- 1.1 Dataset manifest specification
- Define the canonical schema for dataset manifests.
- Require the following fields:
dataset_iddataset_versionvector_dimensiondistance_metricembedding_modelembedding_versionshard_countrecord_countartifact_locationschecksum/hash
- Add JSON Schema validation.
- Implement a Rust struct plus serde validation.
- 1.2 Index manifest specification
- Define the schema for index artifacts.
- Include the following fields:
index_versiondataset_versionbuild_timestampalgorithmshard_metadatacompression_methodquantization_parametersrecall_estimatesbuild_duration
- Implement compatibility checks for:
- dimension match
- dataset version match
- algorithm compatibility
- 1.3 Manifest integrity validation
- Add the CLI command
shardlake validate-manifest. - Verify that shard files exist.
- Verify shard metadata consistency.
- Verify vector dimension consistency.
- Verify artifact checksums.
- Add the CLI command
- 1.4 Deterministic artifact builds
- Ensure index builds produce deterministic outputs.
- Pin RNG seeds.
- Log algorithm parameters.
- Write build metadata to the manifest.
- 1.5 Artifact registry abstraction
- Implement a local artifact registry layout:
artifacts/datasets/indexes/manifests/
- Provide a trait abstraction for the storage backend.
- Implement a local artifact registry layout:
Definition of done
- Dataset and index manifests are versioned, validated, and compatibility checked.
- Manifest validation is exposed through the CLI.
- Artifact builds are reproducible from identical inputs.
- Artifact storage is abstracted behind a registry interface.
Child issue breakdown
- Epic 1.0 - Canonical artifact registry layout and path API #31 Canonical artifact registry layout and path API
- Depends on: none
- Blocks: Epic 1.1 - Dataset manifest schema and persistence #32, Epic 1.2 - Index manifest schema hardening and compatibility checks #33, 1.3 - Manifest integrity validation engine #34, 1.4 - CLI validate-manifest command #35, 1.5 - Deterministic artifact builds and reproducibility metadata #36
- Epic 1.1 - Dataset manifest schema and persistence #32 Dataset manifest schema and persistence
- Epic 1.2 - Index manifest schema hardening and compatibility checks #33 Index manifest schema hardening and compatibility checks
- 1.3 - Manifest integrity validation engine #34 Manifest integrity validation engine
- 1.4 - CLI validate-manifest command #35 CLI validate-manifest command
- Depends on: 1.3 - Manifest integrity validation engine #34
- Blocks: none
- 1.5 - Deterministic artifact builds and reproducibility metadata #36 Deterministic artifact builds and reproducibility metadata
Dependency graph
- Epic 1.0 - Canonical artifact registry layout and path API #31 -> Epic 1.1 - Dataset manifest schema and persistence #32 and Epic 1.2 - Index manifest schema hardening and compatibility checks #33
- Epic 1.1 - Dataset manifest schema and persistence #32 + Epic 1.2 - Index manifest schema hardening and compatibility checks #33 -> 1.3 - Manifest integrity validation engine #34 and 1.5 - Deterministic artifact builds and reproducibility metadata #36
- 1.3 - Manifest integrity validation engine #34 -> 1.4 - CLI validate-manifest command #35
Reactions are currently unavailable
Metadata
Metadata
Labels
epicTop-level epic issueTop-level epic issue