Skip to content

Commit 46405fd

Browse files
committed
Update docs with sync batching changes
1 parent 3a1f9e9 commit 46405fd

File tree

2 files changed

+76
-10
lines changed

2 files changed

+76
-10
lines changed

TODO.md

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,3 @@ No particular order yet, just dumping notes here.
66
we always just perform a merge? So it's Entity -> save() -> changelog -> tables.
77
This would ensure the changes are always handled in the same exact way whether
88
it's sync or save.
9-
10-
- I think it might ultimately be better after all to use manifest files, instead
11-
of list. See https://www.backblaze.com/cloud-storage/transaction-pricing

docs/SYNC_ENGINE.md

Lines changed: 76 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,15 @@ providing a global order across all replicas.
1010
are not on the remote, and pulls any that are on the remote and not local.
1111
5. After a sync, any new local changelogs are merged into the entity tables.
1212

13+
## Changelog Abstraction
14+
15+
The sync engine uses a `Changelog` trait that provides a unified interface for change tracking.
16+
There are multiple implementations:
17+
18+
- **DbChangelog**: Reads/writes changes directly from/to the database tables
19+
- **BasicStorageChangelog**: Simple storage-based changelog (one file per change)
20+
- **BatchingStorageChangelog**: Optimized implementation that batches changes for efficiency
21+
1322
## Conflict Resolution
1423

1524
The Sync Engine treats the database tables as grow only lists and the columns
@@ -18,23 +27,52 @@ be handled at the user level with tombstones.
1827

1928
# Sync Storage
2029

21-
Each replica stores one or more change files in the storage.
30+
The sync engine supports multiple storage implementations through the `SyncStorage` trait:
31+
- S3 and S3-compatible storage (primary target)
32+
- Local filesystem storage
33+
- In-memory storage (for testing)
34+
- Encrypted storage wrapper using age encryption
2235

23-
Each change file contains a `RemoteChangeRecord` which combines:
36+
## Storage Formats
37+
38+
### Basic Storage Format
39+
Each change is stored as an individual file containing a `RemoteChangeRecord` which combines:
2440
- One record from the ZV_CHANGE table (change metadata)
2541
- Associated records from ZV_CHANGE_FIELD table (individual field changes)
2642

43+
### Batching Storage Format (Optimized)
44+
To improve sync performance and reduce memory usage for large datasets, the `BatchingStorageChangelog`
45+
implementation groups changes into batches:
46+
47+
- **Manifests**: Map change IDs to batch IDs (one per author)
48+
- **Batches**: Contain the actual change data (limited to 100MB per batch)
49+
- Automatically splits large sync operations into manageable chunks
50+
- Prevents memory exhaustion when syncing large datasets
51+
2752

2853
## Directory Structure
2954

55+
### Basic Storage Layout
3056
```
31-
s3://endpoint/bucket/base_path/ or
32-
file://base_path or
33-
memory://base_path
57+
storage_root/
3458
└── changes/
35-
└── {change_uuid}.msgpack
59+
└── {change_uuid}.msgpack
3660
```
3761

62+
### Batching Storage Layout
63+
```
64+
storage_root/
65+
├── manifests/ # Maps change IDs to batch IDs
66+
│ └── {author_id}.msgpack
67+
└── batches/ # Contains batched change data
68+
└── {batch_uuid}.msgpack
69+
```
70+
71+
Storage paths support multiple protocols:
72+
- `s3://endpoint/bucket/base_path/`
73+
- `file://base_path/`
74+
- `memory://base_path/`
75+
3876

3977
# Sync Schema
4078

@@ -68,7 +106,7 @@ CREATE TABLE IF NOT EXISTS ZV_CHANGE_FIELD (
68106

69107
## Data Structures
70108

71-
### Change File Format
109+
### Basic Change File Format
72110

73111
Stored in `changes/{change_uuid}.msgpack` using MessagePack binary format:
74112

@@ -98,3 +136,34 @@ Stored in `changes/{change_uuid}.msgpack` using MessagePack binary format:
98136
]
99137
}
100138
```
139+
140+
### Batching Format Structures
141+
142+
**Manifest File** (`manifests/{author_id}.msgpack`):
143+
Maps change IDs to their corresponding batch IDs:
144+
```json
145+
{
146+
"change-001": "batch-uuid-1",
147+
"change-002": "batch-uuid-1",
148+
"change-003": "batch-uuid-2",
149+
...
150+
}
151+
```
152+
153+
**Batch File** (`batches/{batch_uuid}.msgpack`):
154+
Contains an array of changes (same format as basic change files):
155+
```json
156+
[
157+
{
158+
"change": { ... },
159+
"fields": [ ... ]
160+
},
161+
{
162+
"change": { ... },
163+
"fields": [ ... ]
164+
},
165+
...
166+
]
167+
```
168+
169+
Batches are automatically split when they exceed 100MB to prevent memory issues during sync operations.

0 commit comments

Comments
 (0)