Skip to content

Commit b5f00f1

Browse files
chore: added wal/snapshot doc
1 parent 244ac56 commit b5f00f1

File tree

1 file changed

+214
-0
lines changed

1 file changed

+214
-0
lines changed

influxdb3_wal/wal.md

+214
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,214 @@
1+
# WAL (write-ahead-log) and Snapshot process
2+
3+
All writes get buffered into a wal buffer and then every flush interval all the contents are flushed to disk at which point the writes become durable and
4+
the clients are sent the confirmation that their writes are successful. At each flush interval there is also a check to see if queryable buffer need to
5+
evict data to disk, this process is called snapshotting. The rest of this doc discusses all the moving parts in wal flush and snapshotting.
6+
7+
8+
## Overview
9+
10+
```
11+
12+
13+
14+
┌────────────┐ ┌────────────┐
15+
│flush buffer│──────────►│ wal buffer │
16+
└────────────┘ └────────────┘
17+
18+
19+
(takes everything from wal buffer and writes to wal file)
20+
21+
│ (actual background_wal_flush uses wal trait - which wal obj store impls)
22+
│ ┌──────────────(background create)───────────────────────────┐
23+
│ │ ▼
24+
┌─────┴──────┐ │ ┌────────┐
25+
┌────►│wal objstore├─┴──(remove wal,need each wal file num / drop semaphore)─►│wal file│
26+
│ └─────┬──────┘ └────────┘
27+
│ │
28+
│ │
29+
│ │
30+
│ │
31+
│ │
32+
│ │
33+
│ (notifies wal
34+
│ and snapshot optionally & snapshot semaphore taken) ┌─────────────┐
35+
│ │ ┌───────────────►│update buffer│ (whatever wal flushed into chunks)
36+
│ ▼ │ └─────────────┘
37+
│ ┌────────────┐ │ ┌─────────────┐
38+
│ │query buffer├───────────────────────┼───────────────►│parquet file │
39+
│ └─────┬──────┘ │ └─────────────┘
40+
│ │ │
41+
│ (run snapshot) │ ┌─────────────┐
42+
└───────────┘ ├───────────────►│snapshot file│ (holds last wal seq number)
43+
│ └─────────────┘
44+
│ ┌────────────┐
45+
└───────────────►│clear buffer│ (whatever snapshotted is removed)
46+
└────────────┘
47+
```
48+
49+
50+
#### Steps
51+
52+
1. When _writes_ comes in, they go into a write batch in wal buffer. These batches are held per database and the batches keep track of min
53+
and max times within each batch. These batches further hold per table chunks. This chunk is created by taking incoming rows and pinning
54+
them to a period. It is done by `t - (t % gen_1_duration)`. If `gen_1_duration` is 10 mins, then all of the rows will be divided into
55+
10 min chunks. As an example if there are rows for 10.29 and 10.35 then they both go into 2 separate chunks (10.20 and 10.30). And this
56+
10.20 and 10.30 are used later as the key in queryable buffer.
57+
2. Every flush interval, the wal buffer is flushed and all batches are written to to wal file (converts to wal content and gets min/max
58+
times from all batches) and every time wal file contents are written, we add wal period to snapshot tracker that tracks min/max times across
59+
batches that went into single wal file
60+
3. Then snapshot tracker is checked to see if there are enough wal periods to trigger snapshot. There are different conditions checked, the
61+
common scenarios are,
62+
- wal periods > (1.5 * snapshot size). e.g default settings will lead to snapshot size = 600, if we have 900 wal periods trigger snapshot
63+
- force snapshot, irrespective of sizes go ahead and run snapshot
64+
65+
If going ahead with force snapshotting, pick all the wal periods in the tracker and find the max time from most recent wal period. This will be
66+
used as the `end_time_marker` to evict data from query buffer. Because forcing a snapshot can be triggered when wal buffer is empty (even though
67+
queryable buffer is full), we need to add `Noop` (a no-op WalOp) to the wal file to hold the snapshot details in wal file.
68+
69+
```
70+
71+
Snapshot (all, emptying wal periods in tracker)
72+
▲ ▲
73+
│◄───wal in snapshot───────►│
74+
┌──────┬──────┬──────┬──────┤
75+
│ 0 │ 1 │ 2 │ 3 │
76+
│ │ │ │ │
77+
────┴──────┴──────┴──────┴──────┴─────► (time)
78+
79+
```
80+
81+
If it is a normal snapshot, then leave one wal period (`3` in eg below) and pick the last one (`2` in eg below) max time used as `end_time_marker`
82+
and every wal period with max time is less than this `end_time_marker` will be removed from snapshot tracker.
83+
84+
```
85+
Snapshot (wal max < last wal period max,
86+
3 is max here that is left behind)
87+
▲ ▲ ▲
88+
│◄─ wal in snapshot─►│ │
89+
┌──────┬──────┬──────┬──────┤
90+
│ 0 │ 1 │ 2 │ 3 │
91+
│ │ │ │ │
92+
────┴──────┴──────┴──────┴──────┴─────────────► (time)
93+
94+
```
95+
At this point we may or may not have to snapshot, but the wal buffer will still need to be emptied into the wal file. The clients that have been
96+
writing data are held up at most of flush interval time (defaults to 1s). The clients get notified together that the writes have been successful
97+
when writes are persisted to wal file.
98+
99+
4. Then query buffer is notified to update the buffer, and query buffer does the following things in sequence,
100+
- it updates the buffer with incoming rows
101+
- writes to parquet file,
102+
- writes snapshot summary file
103+
- clears the buffer (using the `end_time_marker` that has been passed along)
104+
105+
It is useful to visualize how query buffer holds data internally to understand how the buffer is cleared.
106+
- query buffer holds data as mapping between chunk time and MutableTableChunks (these are col id -> arrow array builder mappings with
107+
min/max times for that chunk), looks roughly like below
108+
```
109+
110+
111+
├───10.20───────────►┌────────────────────────────┐
112+
│ │ chunk 10.20 - 10.29 │
113+
│ │ │
114+
├───10.30───────────►├────────────────────────────┤
115+
│ │ chunk 10.30 - 10.39 │
116+
│ │ │
117+
├───10.40───────────►└────────────────────────────┘
118+
119+
120+
time
121+
122+
```
123+
The crucial thing to note here is, `10.20` and `10.30` are the keys and they're taken from the write batches that have been derived
124+
from newly added rows. These are added to query buffer which manages an arrow backed buffer that holds all the newly added rows.
125+
126+
- Above mutable chunks are added each time a wal flush happens, when snapshotting it uses the `end_time_marker` to evict data. Say,
127+
10.30 is the `end_time_marker` to evict data from queryable buffer, then query buffer evicts all data before 10.30 and holds it as
128+
snapshot chunk which has converted the arrow arrays to a record batch that's ready to be passed to sort/dedupe before writing to
129+
parquet file. Then a summary snapshot file is written which tracks what wal file number was last snapshotted along with pointers to
130+
parquet files created. This data is updated in persisted files so that any new query which spans that time can find data from the
131+
parquet files as buffer will not have them anymore.
132+
133+
5. Once snapshotting process is complete, now deletes all the wal files up to a configured number of wal files to retain. When replaying
134+
the last snapshotted file is looked up from snapshot summary file that has been created so none of the wal files that've already been
135+
snapshotted is loaded into the buffer.
136+
6. If server crashed and restarted wal replay happens. But before that all snapshots are loaded, but even though snapshot file is written
137+
out it doesn't guarantee that wal files relevant to snapshot has been removed.
138+
139+
## Example
140+
141+
Below section walks through some of the nuances touched in the overall process described above
142+
143+
- These are the wal files (1-4) with data for different time ranges [20 - 50] etc. Notice data can be overlapping between time periods
144+
```
145+
1[20 - 50]
146+
2[31 - 70]
147+
3[51 - 90]
148+
4[45 - 110] - snapshot! <= 2 (wal file num), <= 70 (end_time_marker). snapshot details has a file number and end_time_marker.
149+
```
150+
151+
- This is a mapping between queryable buffer's chunk time and what wal files the data comes from. This is not how queryable buffer
152+
maps the data internally, it doesn't know about wal files - however this mapping helps visualise the dependency with wal file better.
153+
The data for time slice 20 originates purely from wal file 1, for 40 it holds rows from wal files 1, 2 and 4 etc.
154+
```
155+
20 - 1
156+
30 - 1
157+
40 - 1, 2, 4
158+
50 - 2, 4
159+
60 - 2, 3, 4
160+
70 - 2, 3, 4
161+
80 - 3, 4
162+
90 - 3, 4
163+
```
164+
165+
- At this point, lets say we need to snapshot as per the details above (`snapshot! <= 2 (wal file num), <= 70 (end_time_marker)`) file is written out
166+
& parquet files removing all items in queryable buffer upto time 70. This is the left over in queryable buffer -
167+
mainly the data that was in wal files 3, 4 are now in parquet files already.
168+
```
169+
80 - 3, 4
170+
90 - 3, 4
171+
```
172+
173+
- If there's been a restart at this point data is loaded from wal files (due to restart), times 60 and 70 will be loaded into memory. so it may look like this,
174+
```
175+
60 - 3, 4
176+
70 - 3, 4
177+
80 - 3, 4
178+
90 - 3, 4
179+
```
180+
181+
- And if further snapshot is kicked off by replaying earlier snapshot from wal file 4, because the `end_time_marker`(=70) is still there in wal file
182+
we'd end up removing it from the query buffer leaving the query buffer in state as expected.
183+
```
184+
80 - 3, 4
185+
90 - 3, 4
186+
```
187+
188+
Say instead we force snapshotted instead of normal snapshotting,
189+
190+
- Everything will be removed from query buffer. These are the wal files with data for different time ranges [20 - 50] etc. Notice data can be overlapping between time periods
191+
```
192+
1[20 - 50]
193+
2[31 - 70]
194+
3[51 - 90]
195+
4[45 - 110] - snapshot! <= 4 (wal file num), <= 110 (end_time_marker)
196+
```
197+
198+
- And query buffer looks like,
199+
```
200+
20 - 1
201+
30 - 1
202+
40 - 1, 2, 4
203+
50 - 2, 4
204+
60 - 2, 3, 4
205+
70 - 2, 3, 4
206+
80 - 3, 4
207+
90 - 3, 4
208+
100 - 4
209+
110 - 4
210+
```
211+
212+
- At this point, snapshot file is written out & parquet files removing all items in queryable buffer upto time 110. So, all files 1, 2, 3, 4 are ready for deletion
213+
- There should be nothing in queryable buffer as end time marker is 110 and if there's been a restart at this point there are no wal files to load and everything is in parquet files (snapshotted)
214+

0 commit comments

Comments
 (0)