[Proposal]: Backup/Restore strategy (simple chart-native + Velero advanced)

## Context

This follows up on #6 and captures research findings so we can implement backup/restore pragmatically.

## Constraints (must-haves)

- Backup must be consistent and include both Chroma index/segment data and SQLite metadata/WAL.
- For filesystem-level backup, we should stop Chroma during backup to prevent writes while data is copied.
- Keep setup open-source, easy to adopt, and not overly complex.

## Proposal: two support tiers

### 1) Simple mode (default, chart-native)

A chart-managed backup workflow with intentional downtime during backup:

- `CronJob` orchestrates backup schedule.
- Backup run:
  1. Read current StatefulSet replica count.
  2. Scale Chroma StatefulSet to `0` and wait until pod termination.
  3. Run backup against PVC data path (`persistDirectory`).
  4. Scale StatefulSet back to original replica count (always, via `trap/finally`).
- Tools supported:
  - `restic` to S3-compatible storage (recommended default).
  - optional direct archive/copy to S3 for ultra-simple installs.

Suggested values (example shape):

```yaml
backup:
  enabled: true
  schedule: "0 2 * * *"
  suspend: false
  tool: restic # restic|s3
  retention:
    keepDaily: 7
    keepWeekly: 4
  s3:
    endpoint: ""
    bucket: ""
    prefix: "chromadb"
    region: ""
  credentials:
    existingSecret: ""
```

### 2) Advanced mode (documented integration)

Document Velero-based strategy for teams that already run Velero:

- Use Velero backup orchestration with pre/post hooks or controlled quiesce.
- Prefer CSI snapshots where available; otherwise Velero filesystem backup path.
- Keep this out of core chart logic (docs-only guidance), since setup complexity is significantly higher.

## Why this split

- Most users get a practical, low-complexity backup path quickly.
- Advanced users can keep platform-standard DR tooling (Velero) without forcing complexity on everyone.

## Open questions

- Do we want direct-S3 mode in v1, or only `restic` first?
- Should restore be shipped in the first iteration (Job + runbook), or immediately after backup MVP?

## Acceptance criteria (MVP)

- Configurable backup CronJob exists and is disabled by default.
- Backups are performed only after Chroma is scaled down.
- StatefulSet is restored to original replica count on success and failure paths.
- Restore Job/runbook is documented and tested in CI smoke workflow.
- README includes both support tiers: Simple mode (native) and Advanced mode (Velero).

## References

- Chroma backup strategy: https://cookbook.chromadb.dev/strategies/chromadb-backups/
- Chroma data layout FAQ: https://cookbook.chromadb.dev/faq/#dude-wheres-my-data
- Kubernetes CronJob: https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/
- Kubernetes StatefulSet scaling: https://kubernetes.io/docs/tasks/run-application/scale-stateful-set/
- restic project/docs: https://restic.net/ , https://restic.readthedocs.io/en/stable/030_preparing_a_new_repo.html
- Velero docs (hooks/CSI/filesystem backup): https://velero.io/docs/main/backup-hooks/ , https://velero.io/docs/main/csi/ , https://velero.io/docs/main/file-system-backup/


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal]: Backup/Restore strategy (simple chart-native + Velero advanced) #135

Context

Constraints (must-haves)

Proposal: two support tiers

1) Simple mode (default, chart-native)

2) Advanced mode (documented integration)

Why this split

Open questions

Acceptance criteria (MVP)

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Proposal]: Backup/Restore strategy (simple chart-native + Velero advanced) #135

Description

Context

Constraints (must-haves)

Proposal: two support tiers

1) Simple mode (default, chart-native)

2) Advanced mode (documented integration)

Why this split

Open questions

Acceptance criteria (MVP)

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions