Skip to content

[Proposal]: Backup/Restore strategy (simple chart-native + Velero advanced) #135

@tazarov

Description

@tazarov

Context

This follows up on #6 and captures research findings so we can implement backup/restore pragmatically.

Constraints (must-haves)

  • Backup must be consistent and include both Chroma index/segment data and SQLite metadata/WAL.
  • For filesystem-level backup, we should stop Chroma during backup to prevent writes while data is copied.
  • Keep setup open-source, easy to adopt, and not overly complex.

Proposal: two support tiers

1) Simple mode (default, chart-native)

A chart-managed backup workflow with intentional downtime during backup:

  • CronJob orchestrates backup schedule.
  • Backup run:
    1. Read current StatefulSet replica count.
    2. Scale Chroma StatefulSet to 0 and wait until pod termination.
    3. Run backup against PVC data path (persistDirectory).
    4. Scale StatefulSet back to original replica count (always, via trap/finally).
  • Tools supported:
    • restic to S3-compatible storage (recommended default).
    • optional direct archive/copy to S3 for ultra-simple installs.

Suggested values (example shape):

backup:
  enabled: true
  schedule: "0 2 * * *"
  suspend: false
  tool: restic # restic|s3
  retention:
    keepDaily: 7
    keepWeekly: 4
  s3:
    endpoint: ""
    bucket: ""
    prefix: "chromadb"
    region: ""
  credentials:
    existingSecret: ""

2) Advanced mode (documented integration)

Document Velero-based strategy for teams that already run Velero:

  • Use Velero backup orchestration with pre/post hooks or controlled quiesce.
  • Prefer CSI snapshots where available; otherwise Velero filesystem backup path.
  • Keep this out of core chart logic (docs-only guidance), since setup complexity is significantly higher.

Why this split

  • Most users get a practical, low-complexity backup path quickly.
  • Advanced users can keep platform-standard DR tooling (Velero) without forcing complexity on everyone.

Open questions

  • Do we want direct-S3 mode in v1, or only restic first?
  • Should restore be shipped in the first iteration (Job + runbook), or immediately after backup MVP?

Acceptance criteria (MVP)

  • Configurable backup CronJob exists and is disabled by default.
  • Backups are performed only after Chroma is scaled down.
  • StatefulSet is restored to original replica count on success and failure paths.
  • Restore Job/runbook is documented and tested in CI smoke workflow.
  • README includes both support tiers: Simple mode (native) and Advanced mode (Velero).

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions