|
| 1 | +# WAL (write-ahead-log) and Snapshot process |
| 2 | + |
| 3 | +All writes get buffered into a wal buffer and then every flush interval all the contents are flushed to disk at which point the writes become durable and |
| 4 | +the clients are sent the confirmation that their writes are successful. At each flush interval there is also a check to see if queryable buffer need to |
| 5 | +evict data to disk, this process is called snapshotting. The rest of this doc discusses all the moving parts in wal flush and snapshotting. |
| 6 | + |
| 7 | + |
| 8 | +## Overview |
| 9 | + |
| 10 | +``` |
| 11 | +
|
| 12 | +
|
| 13 | +
|
| 14 | + ┌────────────┐ ┌────────────┐ |
| 15 | + │flush buffer│──────────►│ wal buffer │ |
| 16 | + └────────────┘ └────────────┘ |
| 17 | + ▲ |
| 18 | + │ |
| 19 | + (takes everything from wal buffer and writes to wal file) |
| 20 | + │ |
| 21 | + │ (actual background_wal_flush uses wal trait - which wal obj store impls) |
| 22 | + │ ┌──────────────(background create)───────────────────────────┐ |
| 23 | + │ │ ▼ |
| 24 | + ┌─────┴──────┐ │ ┌────────┐ |
| 25 | + ┌────►│wal objstore├─┴──(remove wal,need each wal file num / drop semaphore)─►│wal file│ |
| 26 | + │ └─────┬──────┘ └────────┘ |
| 27 | + │ │ |
| 28 | + │ │ |
| 29 | + │ │ |
| 30 | + │ │ |
| 31 | + │ │ |
| 32 | + │ │ |
| 33 | + │ (notifies wal |
| 34 | + │ and snapshot optionally & snapshot semaphore taken) ┌─────────────┐ |
| 35 | + │ │ ┌───────────────►│update buffer│ (whatever wal flushed into chunks) |
| 36 | + │ ▼ │ └─────────────┘ |
| 37 | + │ ┌────────────┐ │ ┌─────────────┐ |
| 38 | + │ │query buffer├───────────────────────┼───────────────►│parquet file │ |
| 39 | + │ └─────┬──────┘ │ └─────────────┘ |
| 40 | + │ │ │ |
| 41 | + │ (run snapshot) │ ┌─────────────┐ |
| 42 | + └───────────┘ ├───────────────►│snapshot file│ (holds last wal seq number) |
| 43 | + │ └─────────────┘ |
| 44 | + │ ┌────────────┐ |
| 45 | + └───────────────►│clear buffer│ (whatever snapshotted is removed) |
| 46 | + └────────────┘ |
| 47 | +``` |
| 48 | + |
| 49 | + |
| 50 | +#### Steps |
| 51 | + |
| 52 | +1. When _writes_ comes in, they go into a write batch in wal buffer. These batches are held per database and the batches keep track of min |
| 53 | + and max times within each batch. These batches further hold per table chunks. This chunk is created by taking incoming rows and pinning |
| 54 | + them to a period. It is done by `t - (t % gen_1_duration)`. If `gen_1_duration` is 10 mins, then all of the rows will be divided into |
| 55 | + 10 min chunks. As an example if there are rows for 10.29 and 10.35 then they both go into 2 separate chunks (10.20 and 10.30). And this |
| 56 | + 10.20 and 10.30 are used later as the key in queryable buffer. |
| 57 | +2. Every flush interval, the wal buffer is flushed and all batches are written to to wal file (converts to wal content and gets min/max |
| 58 | + times from all batches) and every time wal file contents are written, we add wal period to snapshot tracker that tracks min/max times across |
| 59 | + batches that went into single wal file |
| 60 | +3. Then snapshot tracker is checked to see if there are enough wal periods to trigger snapshot. There are different conditions checked, the |
| 61 | + common scenarios are, |
| 62 | + - wal periods > (1.5 * snapshot size). e.g default settings will lead to snapshot size = 600, if we have 900 wal periods trigger snapshot |
| 63 | + - force snapshot, irrespective of sizes go ahead and run snapshot |
| 64 | + |
| 65 | + If going ahead with force snapshotting, pick all the wal periods in the tracker and find the max time from most recent wal period. This will be |
| 66 | + used as the `end_time_marker` to evict data from query buffer. Because forcing a snapshot can be triggered when wal buffer is empty (even though |
| 67 | + queryable buffer is full), we need to add `Noop` (a no-op WalOp) to the wal file to hold the snapshot details in wal file. |
| 68 | + |
| 69 | + ``` |
| 70 | +
|
| 71 | + Snapshot (all, emptying wal periods in tracker) |
| 72 | + ▲ ▲ |
| 73 | + │◄───wal in snapshot───────►│ |
| 74 | + ┌──────┬──────┬──────┬──────┤ |
| 75 | + │ 0 │ 1 │ 2 │ 3 │ |
| 76 | + │ │ │ │ │ |
| 77 | + ────┴──────┴──────┴──────┴──────┴─────► (time) |
| 78 | +
|
| 79 | + ``` |
| 80 | + |
| 81 | + If it is a normal snapshot, then leave one wal period (`3` in eg below) and pick the last one (`2` in eg below) max time used as `end_time_marker` |
| 82 | + and every wal period with max time is less than this `end_time_marker` will be removed from snapshot tracker. |
| 83 | + |
| 84 | + ``` |
| 85 | + Snapshot (wal max < last wal period max, |
| 86 | + 3 is max here that is left behind) |
| 87 | + ▲ ▲ ▲ |
| 88 | + │◄─ wal in snapshot─►│ │ |
| 89 | + ┌──────┬──────┬──────┬──────┤ |
| 90 | + │ 0 │ 1 │ 2 │ 3 │ |
| 91 | + │ │ │ │ │ |
| 92 | + ────┴──────┴──────┴──────┴──────┴─────────────► (time) |
| 93 | +
|
| 94 | + ``` |
| 95 | + At this point we may or may not have to snapshot, but the wal buffer will still need to be emptied into the wal file. The clients that have been |
| 96 | + writing data are held up at most of flush interval time (defaults to 1s). The clients get notified together that the writes have been successful |
| 97 | + when writes are persisted to wal file. |
| 98 | + |
| 99 | +4. Then query buffer is notified to update the buffer, and query buffer does the following things in sequence, |
| 100 | + - it updates the buffer with incoming rows |
| 101 | + - writes to parquet file, |
| 102 | + - writes snapshot summary file |
| 103 | + - clears the buffer (using the `end_time_marker` that has been passed along) |
| 104 | + |
| 105 | + It is useful to visualize how query buffer holds data internally to understand how the buffer is cleared. |
| 106 | + - query buffer holds data as mapping between chunk time and MutableTableChunks (these are col id -> arrow array builder mappings with |
| 107 | + min/max times for that chunk), looks roughly like below |
| 108 | + ``` |
| 109 | + |
| 110 | + │ |
| 111 | + ├───10.20───────────►┌────────────────────────────┐ |
| 112 | + │ │ chunk 10.20 - 10.29 │ |
| 113 | + │ │ │ |
| 114 | + ├───10.30───────────►├────────────────────────────┤ |
| 115 | + │ │ chunk 10.30 - 10.39 │ |
| 116 | + │ │ │ |
| 117 | + ├───10.40───────────►└────────────────────────────┘ |
| 118 | + │ |
| 119 | + ▼ |
| 120 | + time |
| 121 | + |
| 122 | + ``` |
| 123 | + The crucial thing to note here is, `10.20` and `10.30` are the keys and they're taken from the write batches that have been derived |
| 124 | + from newly added rows. These are added to query buffer which manages an arrow backed buffer that holds all the newly added rows. |
| 125 | + |
| 126 | + - Above mutable chunks are added each time a wal flush happens, when snapshotting it uses the `end_time_marker` to evict data. Say, |
| 127 | + 10.30 is the `end_time_marker` to evict data from queryable buffer, then query buffer evicts all data before 10.30 and holds it as |
| 128 | + snapshot chunk which has converted the arrow arrays to a record batch that's ready to be passed to sort/dedupe before writing to |
| 129 | + parquet file. Then a summary snapshot file is written which tracks what wal file number was last snapshotted along with pointers to |
| 130 | + parquet files created. This data is updated in persisted files so that any new query which spans that time can find data from the |
| 131 | + parquet files as buffer will not have them anymore. |
| 132 | + |
| 133 | +5. Once snapshotting process is complete, now deletes all the wal files up to a configured number of wal files to retain. When replaying |
| 134 | + the last snapshotted file is looked up from snapshot summary file that has been created so none of the wal files that've already been |
| 135 | + snapshotted is loaded into the buffer. |
| 136 | +6. If server crashed and restarted wal replay happens. But before that all snapshots are loaded, but even though snapshot file is written |
| 137 | + out it doesn't guarantee that wal files relevant to snapshot has been removed. |
| 138 | + |
| 139 | +## Example |
| 140 | + |
| 141 | +Below section walks through some of the nuances touched in the overall process described above |
| 142 | + |
| 143 | +- These are the wal files (1-4) with data for different time ranges [20 - 50] etc. Notice data can be overlapping between time periods |
| 144 | +``` |
| 145 | +1[20 - 50] |
| 146 | +2[31 - 70] |
| 147 | +3[51 - 90] |
| 148 | +4[45 - 110] - snapshot! <= 2 (wal file num), <= 70 (end_time_marker). snapshot details has a file number and end_time_marker. |
| 149 | +``` |
| 150 | + |
| 151 | +- This is a mapping between queryable buffer's chunk time and what wal files the data comes from. This is not how queryable buffer |
| 152 | + maps the data internally, it doesn't know about wal files - however this mapping helps visualise the dependency with wal file better. |
| 153 | + The data for time slice 20 originates purely from wal file 1, for 40 it holds rows from wal files 1, 2 and 4 etc. |
| 154 | +``` |
| 155 | +20 - 1 |
| 156 | +30 - 1 |
| 157 | +40 - 1, 2, 4 |
| 158 | +50 - 2, 4 |
| 159 | +60 - 2, 3, 4 |
| 160 | +70 - 2, 3, 4 |
| 161 | +80 - 3, 4 |
| 162 | +90 - 3, 4 |
| 163 | +``` |
| 164 | + |
| 165 | +- At this point, lets say we need to snapshot as per the details above (`snapshot! <= 2 (wal file num), <= 70 (end_time_marker)`) file is written out |
| 166 | + & parquet files removing all items in queryable buffer upto time 70. This is the left over in queryable buffer - |
| 167 | + mainly the data that was in wal files 3, 4 are now in parquet files already. |
| 168 | +``` |
| 169 | +80 - 3, 4 |
| 170 | +90 - 3, 4 |
| 171 | +``` |
| 172 | + |
| 173 | +- If there's been a restart at this point data is loaded from wal files (due to restart), times 60 and 70 will be loaded into memory. so it may look like this, |
| 174 | +``` |
| 175 | +60 - 3, 4 |
| 176 | +70 - 3, 4 |
| 177 | +80 - 3, 4 |
| 178 | +90 - 3, 4 |
| 179 | +``` |
| 180 | + |
| 181 | +- And if further snapshot is kicked off by replaying earlier snapshot from wal file 4, because the `end_time_marker`(=70) is still there in wal file |
| 182 | + we'd end up removing it from the query buffer leaving the query buffer in state as expected. |
| 183 | +``` |
| 184 | +80 - 3, 4 |
| 185 | +90 - 3, 4 |
| 186 | +``` |
| 187 | + |
| 188 | +Say instead we force snapshotted instead of normal snapshotting, |
| 189 | + |
| 190 | +- Everything will be removed from query buffer. These are the wal files with data for different time ranges [20 - 50] etc. Notice data can be overlapping between time periods |
| 191 | +``` |
| 192 | +1[20 - 50] |
| 193 | +2[31 - 70] |
| 194 | +3[51 - 90] |
| 195 | +4[45 - 110] - snapshot! <= 4 (wal file num), <= 110 (end_time_marker) |
| 196 | +``` |
| 197 | + |
| 198 | +- And query buffer looks like, |
| 199 | +``` |
| 200 | +20 - 1 |
| 201 | +30 - 1 |
| 202 | +40 - 1, 2, 4 |
| 203 | +50 - 2, 4 |
| 204 | +60 - 2, 3, 4 |
| 205 | +70 - 2, 3, 4 |
| 206 | +80 - 3, 4 |
| 207 | +90 - 3, 4 |
| 208 | +100 - 4 |
| 209 | +110 - 4 |
| 210 | +``` |
| 211 | + |
| 212 | +- At this point, snapshot file is written out & parquet files removing all items in queryable buffer upto time 110. So, all files 1, 2, 3, 4 are ready for deletion |
| 213 | +- There should be nothing in queryable buffer as end time marker is 110 and if there's been a restart at this point there are no wal files to load and everything is in parquet files (snapshotted) |
| 214 | + |
0 commit comments