You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: vignettes/data_review.Rmd
+18-18Lines changed: 18 additions & 18 deletions
Original file line number
Diff line number
Diff line change
@@ -146,7 +146,7 @@ If we take a hypothetical "xyz" domain, `dv.listings` will store the following f
146
146
- m `tracked_vars` variable types (see "Variable type encoding" below)
147
147
- 1 row count
148
148
- p (1 per "xyz" row) `hash_id(xyz[id_vars])`
149
-
- p (1 per "xyz" row, *m* bytes long) `hash_tracked(xyz[tracked_vars])`
149
+
- p (1 per "xyz" row, 2\*m bytes long) `hash_tracked(xyz[tracked_vars])`
150
150
</details>
151
151
152
152
<details><summary>`xyz_001.delta` (one per domain dataset update)</summary>
@@ -158,10 +158,10 @@ If we take a hypothetical "xyz" domain, `dv.listings` will store the following f
158
158
- 1 domain string ("xyz")
159
159
- 1 count of new rows
160
160
- n (1 per *new* "xyz" row) `hash_id(xyz[id_vars])`
161
-
- n (1 per *new* "xyz" row, *m* bytes long) `hash_tracked(xyz[tracked_vars])`
161
+
- n (1 per *new* "xyz" row, 2\*m bytes long) `hash_tracked(xyz[tracked_vars])`
162
162
- 1 count of modified rows
163
163
- p (1 per *modified* "xyz" row) row index
164
-
- p (1 per *modified* "xyz" row, *m* bytes long) `hash_tracked(xyz[tracked_vars])`
164
+
- p (1 per *modified* "xyz" row, 2\*m bytes long) `hash_tracked(xyz[tracked_vars])`
165
165
</details>
166
166
167
167
<details><summary>`xyz_<ROLE>.review` (one per domain and ROLE)</summary>
@@ -180,7 +180,7 @@ If we take a hypothetical "xyz" domain, `dv.listings` will store the following f
180
180
181
181
(Row indices refer to indices in the stored base+delta matrix, which is append-only. These _canonical_ indices are as good as identifiers).
182
182
183
-
The dominant factor governing the size of these files is the length of a hash, which is 16 bytes, as we discuss in the "Hashing" session below. Row indices and delta timestamps can be encoded in 4 bytes. Review indices take up 1 byte each. Estimating an upper bound of 1 million rows per dataset, a `.base` file would take around 32 MiB. A comprehensive `.review` file for such a dataset would take around 9 MiB.
183
+
The dominant factor governing the size of these files is the length of a hash, which is 16 bytes in the case of `hash_id` and 2\*m bytes (m being the number of tracked columns) in the case of `hash_tracked`, as we discuss in the "Hashing" session below. Row indices and delta timestamps can be encoded in 4 bytes. Review indices take up 1 byte each. Estimating an upper bound of 1 million rows per dataset, a `.base` file tracking 8 variables would take around 32 MiB. A comprehensive `.review` file for such a dataset would take around 9 MiB.
184
184
185
185
These file structures are designed so that they start with a short heterogeneous header that reiterates the information that can be gleaned from the file name. The rest of the records are all homogeneous and of known size. That allows to load them into memory without the need for expensive parsing.
186
186
@@ -232,36 +232,36 @@ We opt instead for 128 bits, which makes the possibility of a collision extremel
232
232
#### Hashing of tracked variables (`hash_tracked()`)
233
233
We could apply the same reasoning behind the choice of the `hash_id()` function to the hashing of the variable parts of each row. We instead propose a more complex hashing scheme to provide partial information about *which variables of a row have been altered* when its hash changes.
234
234
235
-
Each hash value is *m* bytes long, where *m* is the number of variables tracked of a given dataset. Each of those bytes is an independent hash of three of the tracked variables of a dataset row. Each variable, in turn, contributes to three of the *m* byte-sized hashes. This mixing of variables makes it harder for an external adversarial observer of the `.base` and `.delta` files to brute-force the original values of the dataset by looking for collisions with the computed hash values.
235
+
Each hash value is 2\*m bytes long, where *m* is the number of variables tracked of a given dataset. Each of those bytes is an independent hash of three of the tracked variables of a dataset row. Each variable, in turn, contributes to three of the 2\*m byte-sized hashes. This mixing of variables makes it harder for an external adversarial observer of the `.base` and `.delta` files to brute-force the original values of the dataset by looking for collisions with the computed hash values.
236
236
237
-
To compute which variables contribute to which hash byte, we use the following scheme:
237
+
To compute which variables contribute to which hash byte pair, we use the following scheme:
238
238
239
-
- Byte *n*: Variables (*n*+0)%*n*, (*n*+2)%*n* and (*n*+3)%*n*
239
+
- Byte pair *n*: Variables (*n*+0)%*n*, (*n*+2)%*n* and (*n*+3)%*n*
240
240
241
241
Where `%` indicates the remainder of the integer division.
242
242
243
243
So, for a input dataset with seven tracked variables (zero through six), this would mean:
244
244
245
-
- Byte 0: Variables 0, 2 and 3
246
-
- Byte 1: Variables 1, 3 and 4
247
-
- Byte 2: Variables 2, 4 and 5
248
-
- Byte 3: Variables 3, 5 and 6
249
-
- Byte 4: Variables 4, 6 and 0
250
-
- Byte 5: Variables 5, 0 and 1
251
-
- Byte 6: Variables 6, 1 and 2
245
+
- Byte pair 0: Variables 0, 2 and 3
246
+
- Byte pair 1: Variables 1, 3 and 4
247
+
- Byte pair 2: Variables 2, 4 and 5
248
+
- Byte pair 3: Variables 3, 5 and 6
249
+
- Byte pair 4: Variables 4, 6 and 0
250
+
- Byte pair 5: Variables 5, 0 and 1
251
+
- Byte pair 6: Variables 6, 1 and 2
252
252
253
-
This scheme creates a unique mixtures of variables. Take, for instance, variable 0. It is combined with variables 2 and 3 on the zeroth byte, with variables 4 and 6 on the fourth byte and with 1 and 5 for the fifth byte.
253
+
This scheme creates a unique mixtures of variables. Take, for instance, variable 0. It is combined with variables 2 and 3 on the zeroth byte pair, with variables 4 and 6 on the fourth byte pair and with 1 and 5 for the fifth byte pair.
254
254
255
-
Each of these bytes is computed by:
255
+
Each of these byte pairs is computed by:
256
256
257
257
- Taking the three values to hash.
258
258
- Serializing them to text and concatenating them using the non-ASCII byte separator `1D` (also known as "group separator").
259
-
- Computing the `xxh32` hash and returning its most significant byte.
259
+
- Computing the `xxh32` hash and returning its two most significant bytes.
260
260
261
261
Informal testing (refer to `tests/testthat/tests-hash_tracked.R` for more details) of this hashing scheme shows the following properties:
262
262
263
263
- It's capable of identifying up to four modified variables per row (after that, it's preferable to give up and notify the whole row as modified).
264
-
- It has a very low **false negative rate** (a variable is modified without it being notified as such) of one for every 8 million row updates.
264
+
- It has a very low **false negative rate** (a variable is modified without it being notified as such).
265
265
- It has a low **false positive rate** (a variable that retains its value is notified as modified). This only happens when there are actual changes to a row.
266
266
267
267
False positives are not critical, as they ask reviewers to consider a larger set of variables when re-reviewing a row that has been altered.
0 commit comments