Skip to content

Epic 1 - Artifact and Dataset Lifecycle #6

@rmax

Description

@rmax

Parent

Goal

Make dataset, embeddings, and index artifacts first-class immutable objects with deterministic reproducibility.

Why this matters

In large-scale vector systems, artifact drift and incompatible index versions are common operational failure modes.

Detailed tasks

  • 1.1 Dataset manifest specification
    • Define the canonical schema for dataset manifests.
    • Require the following fields:
      • dataset_id
      • dataset_version
      • vector_dimension
      • distance_metric
      • embedding_model
      • embedding_version
      • shard_count
      • record_count
      • artifact_locations
      • checksum / hash
    • Add JSON Schema validation.
    • Implement a Rust struct plus serde validation.
  • 1.2 Index manifest specification
    • Define the schema for index artifacts.
    • Include the following fields:
      • index_version
      • dataset_version
      • build_timestamp
      • algorithm
      • shard_metadata
      • compression_method
      • quantization_parameters
      • recall_estimates
      • build_duration
    • Implement compatibility checks for:
      • dimension match
      • dataset version match
      • algorithm compatibility
  • 1.3 Manifest integrity validation
    • Add the CLI command shardlake validate-manifest.
    • Verify that shard files exist.
    • Verify shard metadata consistency.
    • Verify vector dimension consistency.
    • Verify artifact checksums.
  • 1.4 Deterministic artifact builds
    • Ensure index builds produce deterministic outputs.
    • Pin RNG seeds.
    • Log algorithm parameters.
    • Write build metadata to the manifest.
  • 1.5 Artifact registry abstraction
    • Implement a local artifact registry layout:
      • artifacts/
      • datasets/
      • indexes/
      • manifests/
    • Provide a trait abstraction for the storage backend.

Definition of done

  • Dataset and index manifests are versioned, validated, and compatibility checked.
  • Manifest validation is exposed through the CLI.
  • Artifact builds are reproducible from identical inputs.
  • Artifact storage is abstracted behind a registry interface.

Child issue breakdown

Dependency graph

Metadata

Metadata

Assignees

Labels

epicTop-level epic issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions