Context
This follows up on #6 and captures research findings so we can implement backup/restore pragmatically.
Constraints (must-haves)
- Backup must be consistent and include both Chroma index/segment data and SQLite metadata/WAL.
- For filesystem-level backup, we should stop Chroma during backup to prevent writes while data is copied.
- Keep setup open-source, easy to adopt, and not overly complex.
Proposal: two support tiers
1) Simple mode (default, chart-native)
A chart-managed backup workflow with intentional downtime during backup:
CronJob orchestrates backup schedule.
- Backup run:
- Read current StatefulSet replica count.
- Scale Chroma StatefulSet to
0 and wait until pod termination.
- Run backup against PVC data path (
persistDirectory).
- Scale StatefulSet back to original replica count (always, via
trap/finally).
- Tools supported:
restic to S3-compatible storage (recommended default).
- optional direct archive/copy to S3 for ultra-simple installs.
Suggested values (example shape):
backup:
enabled: true
schedule: "0 2 * * *"
suspend: false
tool: restic # restic|s3
retention:
keepDaily: 7
keepWeekly: 4
s3:
endpoint: ""
bucket: ""
prefix: "chromadb"
region: ""
credentials:
existingSecret: ""
2) Advanced mode (documented integration)
Document Velero-based strategy for teams that already run Velero:
- Use Velero backup orchestration with pre/post hooks or controlled quiesce.
- Prefer CSI snapshots where available; otherwise Velero filesystem backup path.
- Keep this out of core chart logic (docs-only guidance), since setup complexity is significantly higher.
Why this split
- Most users get a practical, low-complexity backup path quickly.
- Advanced users can keep platform-standard DR tooling (Velero) without forcing complexity on everyone.
Open questions
- Do we want direct-S3 mode in v1, or only
restic first?
- Should restore be shipped in the first iteration (Job + runbook), or immediately after backup MVP?
Acceptance criteria (MVP)
- Configurable backup CronJob exists and is disabled by default.
- Backups are performed only after Chroma is scaled down.
- StatefulSet is restored to original replica count on success and failure paths.
- Restore Job/runbook is documented and tested in CI smoke workflow.
- README includes both support tiers: Simple mode (native) and Advanced mode (Velero).
References
Context
This follows up on #6 and captures research findings so we can implement backup/restore pragmatically.
Constraints (must-haves)
Proposal: two support tiers
1) Simple mode (default, chart-native)
A chart-managed backup workflow with intentional downtime during backup:
CronJoborchestrates backup schedule.0and wait until pod termination.persistDirectory).trap/finally).resticto S3-compatible storage (recommended default).Suggested values (example shape):
2) Advanced mode (documented integration)
Document Velero-based strategy for teams that already run Velero:
Why this split
Open questions
resticfirst?Acceptance criteria (MVP)
References