@@ -10,6 +10,15 @@ providing a global order across all replicas.
1010are not on the remote, and pulls any that are on the remote and not local.
11115 . After a sync, any new local changelogs are merged into the entity tables.
1212
13+ ## Changelog Abstraction
14+
15+ The sync engine uses a ` Changelog ` trait that provides a unified interface for change tracking.
16+ There are multiple implementations:
17+
18+ - ** DbChangelog** : Reads/writes changes directly from/to the database tables
19+ - ** BasicStorageChangelog** : Simple storage-based changelog (one file per change)
20+ - ** BatchingStorageChangelog** : Optimized implementation that batches changes for efficiency
21+
1322## Conflict Resolution
1423
1524The Sync Engine treats the database tables as grow only lists and the columns
@@ -18,23 +27,52 @@ be handled at the user level with tombstones.
1827
1928# Sync Storage
2029
21- Each replica stores one or more change files in the storage.
30+ The sync engine supports multiple storage implementations through the ` SyncStorage ` trait:
31+ - S3 and S3-compatible storage (primary target)
32+ - Local filesystem storage
33+ - In-memory storage (for testing)
34+ - Encrypted storage wrapper using age encryption
2235
23- Each change file contains a ` RemoteChangeRecord ` which combines:
36+ ## Storage Formats
37+
38+ ### Basic Storage Format
39+ Each change is stored as an individual file containing a ` RemoteChangeRecord ` which combines:
2440- One record from the ZV_CHANGE table (change metadata)
2541- Associated records from ZV_CHANGE_FIELD table (individual field changes)
2642
43+ ### Batching Storage Format (Optimized)
44+ To improve sync performance and reduce memory usage for large datasets, the ` BatchingStorageChangelog `
45+ implementation groups changes into batches:
46+
47+ - ** Manifests** : Map change IDs to batch IDs (one per author)
48+ - ** Batches** : Contain the actual change data (limited to 100MB per batch)
49+ - Automatically splits large sync operations into manageable chunks
50+ - Prevents memory exhaustion when syncing large datasets
51+
2752
2853## Directory Structure
2954
55+ ### Basic Storage Layout
3056```
31- s3://endpoint/bucket/base_path/ or
32- file://base_path or
33- memory://base_path
57+ storage_root/
3458└── changes/
35- └── {change_uuid}.msgpack
59+ └── {change_uuid}.msgpack
3660```
3761
62+ ### Batching Storage Layout
63+ ```
64+ storage_root/
65+ ├── manifests/ # Maps change IDs to batch IDs
66+ │ └── {author_id}.msgpack
67+ └── batches/ # Contains batched change data
68+ └── {batch_uuid}.msgpack
69+ ```
70+
71+ Storage paths support multiple protocols:
72+ - ` s3://endpoint/bucket/base_path/ `
73+ - ` file://base_path/ `
74+ - ` memory://base_path/ `
75+
3876
3977# Sync Schema
4078
@@ -68,7 +106,7 @@ CREATE TABLE IF NOT EXISTS ZV_CHANGE_FIELD (
68106
69107## Data Structures
70108
71- ### Change File Format
109+ ### Basic Change File Format
72110
73111Stored in ` changes/{change_uuid}.msgpack ` using MessagePack binary format:
74112
@@ -98,3 +136,34 @@ Stored in `changes/{change_uuid}.msgpack` using MessagePack binary format:
98136 ]
99137}
100138```
139+
140+ ### Batching Format Structures
141+
142+ ** Manifest File** (` manifests/{author_id}.msgpack ` ):
143+ Maps change IDs to their corresponding batch IDs:
144+ ``` json
145+ {
146+ "change-001" : " batch-uuid-1" ,
147+ "change-002" : " batch-uuid-1" ,
148+ "change-003" : " batch-uuid-2" ,
149+ ...
150+ }
151+ ```
152+
153+ ** Batch File** (` batches/{batch_uuid}.msgpack ` ):
154+ Contains an array of changes (same format as basic change files):
155+ ``` json
156+ [
157+ {
158+ "change" : { ... },
159+ "fields" : [ ... ]
160+ },
161+ {
162+ "change" : { ... },
163+ "fields" : [ ... ]
164+ },
165+ ...
166+ ]
167+ ```
168+
169+ Batches are automatically split when they exceed 100MB to prevent memory issues during sync operations.
0 commit comments