You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: vignettes/review_design_notes.Rmd
+22-2Lines changed: 22 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -122,9 +122,9 @@ There will use a small collection of files for each input dataset configured for
122
122
- 1 complete hash of "ae" data.frame
123
123
- 1 domain string ("ae")
124
124
- n `id_vars` column names
125
-
-**MISSING**: n `id_vars` column types
125
+
- n `id_vars` column types (see "Variable type encoding" below)
126
126
- m `tracked_vars` column names
127
-
-**MISSING**: m `tracked_vars` column types
127
+
- m `tracked_vars` column types (see "Variable type encoding" below)
128
128
- 1 row count
129
129
- p (1 per "ae" row) `hash_id(ae[id_vars])`
130
130
- p (1 per "ae" row, *m* bytes long) `hash_tracked(ae[tracked_vars])`
@@ -160,6 +160,26 @@ These file structures are designed so that they start with a short heterogeneous
160
160
161
161
These files won't benefit much from compression since their main content (the hashes) is by construction statistically indistinguishable from noise.
162
162
163
+
#### Variable type encoding
164
+
The two "variable type" `.base` fields are encoded as single bytes that take the following values:
165
+
166
+
- Date: 1
167
+
- POSIXct: 2
168
+
- POSIXlt: 3
169
+
- Logical: 10
170
+
- Factor: 11
171
+
- Integer: 13
172
+
- Numeric: 14
173
+
- Complex: 15
174
+
- Character: 16
175
+
- Raw: 24
176
+
177
+
Most of these values are taken from the base R `SEXPTYPE` enum definition (see `src/include/Rinternals.h` on any recent R source distribution).
178
+
179
+
The values assigned to time types are arbitrary, because they are S3 objects and thus lack dedicated `SEXPTYPE` values.
180
+
181
+
The type of a `factor()` variable is not fully defined by it being tagged as such, since the levels and their internal encoding is also part of the type. For purposes of hashing, the review feature of `dv.listings` treats the content of factor columns as `character()` by mapping their value to their assign string-like representation. This feature also is indifferent to a factor being ordered.
182
+
163
183
## Hashing
164
184
We store hashes for the values of `id_vars` and `tracked_vars` dataset columns. These hashes serve as content IDs.
0 commit comments