Sietch Vault uses content-defined chunking and cryptographic deduplication to minimize redundant data storage.
Instead of storing an entire file whenever it changes, Sietch breaks each file into small, fixed- or variable-sized chunks (default: 4 MB), computes unique hashes for each chunk, and only stores chunks that haven't been stored before.
This makes syncing and storage highly efficient i.e identical data across files, folders, or even vaults are stored only once, reducing disk space and sync time.
Deduplication in Sietch happens at chunk level, not file level.
-
Chunking
- Every file added to a vault is split into smaller data chunks.
- Chunk size is configurable (
--chunk-sizeflag, default 4 MB). - Chunk boundaries can be static or computed using a rolling hash, which helps detect identical regions even when a file shifts slightly.
-
Hashing
- Each chunk is assigned two hashes:
- Content Hash: A cryptographic hash (e.g., SHA-256) of the unencrypted chunk.
- Storage Hash: A hash of the encrypted chunk (post-encryption).
- Each chunk is assigned two hashes:
-
Deduplication Check
- When a new chunk is processed, Sietch checks if the content hash already exists in the vault index.
- If found, it reuses the existing chunk reference instead of storing a duplicate.
-
Encryption
- Chunks are encrypted after deduplication check.
- This ensures that identical plaintext chunks yield identical content hashes (so dedup works), while the storage hash maintains uniqueness in encrypted storage.
- Supported encryption modes:
- AES-256-GCM
- ChaCha20-Poly1305
-
Manifest Tracking
- Each file's metadata (its list of chunks, sizes, hashes) is stored in a manifest.
- The manifest maps logical files to their deduplicated chunks in the storage backend.
- Manifests are versioned, so rolling back or verifying integrity is easy.
| Type | Definition | Purpose | Scope |
|---|---|---|---|
| Content Hash | Hash of the raw (unencrypted) chunk data. | Used to identify identical content for deduplication before encryption. | Computed once during ingestion. |
| Storage Hash | Hash of the encrypted chunk data. | Used to verify integrity and locate stored encrypted blobs. | Used internally during sync and retrieval. |
In short:
- Content Hash = Dedup identity
- Storage Hash = Integrity + storage mapping
By separating these two, Sietch maintains efficient deduplication while ensuring secure encryption and data integrity.
If you created a vault before deduplication was enabled, follow this step-by-step process to migrate safely.
Before any operation:
sietch backup --output ./vault-backupThis ensures you can roll back if migration fails.
Enable dedup in the configuration:
sietch config set dedup.enabled true
sietch config set dedup.chunk-size 4MBOr manually in the config file (~/.sietch/config.json):
{
"dedup": {
"enabled": true,
"chunk_size": "4MB"
}
}Run the re-indexing tool to compute chunk hashes for existing data:
sietch dedup reindexThis step scans all files, computes content hashes, and builds a deduplication index.
Once the dedup index is ready:
sietch dedup gcRemoves orphaned or redundant chunks not referenced in any manifest.
To finalize:
sietch dedup optimizeReorganizes chunks and manifests for better read/write performance.
Note: For very large vaults, perform these steps on a local copy or use the --dry-run flag first to estimate changes.
Chunk size directly affects both storage efficiency and CPU performance:
| Chunk Size | Use Case | Storage Efficiency | CPU Cost |
|---|---|---|---|
| 1 MB | Rapidly changing files, e.g., source code, logs | High | High |
| 4 MB (default) | Balanced general purpose | Medium | Medium |
| 8–16 MB | Large static files (media, backups) | Lower dedup gain | Low |
Tips:
- Smaller chunks → better deduplication but slower processing.
- Larger chunks → faster sync and less overhead but fewer dedup hits.
- For mixed workloads, keep 4 MB or use
--adaptive-chunking(if available).
- Run
sietch dedup statsregularly to monitor chunk reuse and storage savings. - Avoid changing chunk size after initial vault creation as this can break dedup references.
- Use
sietch dedup optimizemonthly to defragment storage. - Keep your manifests backed up; they're critical for mapping files to chunks.
- Use
--dry-runwith dedup operations before running them in production.
# Initialize vault with ChaCha20 encryption
sietch init --name research --key-type chacha20
# Add files
sietch add ./datasets ./vault/data
# Check dedup stats
sietch dedup stats
# Clean up unreferenced chunks
sietch dedup gc
# Optimize layout
sietch dedup optimizeOutput might look like:
Deduplication Statistics
------------------------
Total Chunks: 12,843
Unique Chunks: 9,557
Space Saved: 4.21 GB (32%)
Garbage Collected: 152 chunks
Optimization Complete: OK- Adaptive Chunking --> variable chunk sizes based on content entropy.
- Cross-Vault Dedup --> share dedup indices securely across multiple vaults.
- Dedup Metrics API --> expose storage savings via REST/CLI metrics.