|
| 1 | +--- |
| 2 | +title: "Automatic Generation of Data Dictionaries" |
| 3 | +engine: knitr |
| 4 | +--- |
| 5 | + |
| 6 | +**Note: This is an add-on to the Chapter "[Add Data and Data Dictionary](/data.qmd)". It describes how you can (a) automatically generate data dictionaries with an R package, and (b) how to create a machine readable documentation of your data.** |
| 7 | + |
| 8 | +## Automatic Generation of Data Dictionaries |
| 9 | + |
| 10 | +First, we will demonstrate how to create a simple data dictionary |
| 11 | +using the R package [`datawizard`][datawizard]. We will use the penguin data set which is introduced in the Chapter "[Add Data and Data Dictionary](/data.qmd)". |
| 12 | +You can download it and put it into your project folder: |
| 13 | + |
| 14 | +[data.csv](../data.csv){.btn .btn-lg .btn-info download="data.csv"} |
| 15 | + |
| 16 | +You can install the `datawizard` package into our `renv` environment using: |
| 17 | + |
| 18 | +[datawizard]: https://easystats.github.io/datawizard/ |
| 19 | + |
| 20 | +```{.r filename="Console"} |
| 21 | +renv::install("datawizard") |
| 22 | +``` |
| 23 | + |
| 24 | +We create a separate Quarto file for the data dictionary. |
| 25 | +Create it by clicking on _File_ > _New File_ > _Quarto Document..._. |
| 26 | +Choose a title such as `Data Dictionary`, |
| 27 | +select _HTML_ as format, |
| 28 | +uncheck the use of the visual markdown editor, and click on _Create_. |
| 29 | +Remove everything except the YAML header (between the `---`). |
| 30 | +To make the HTML file self-contained, |
| 31 | +also set `embed-resources: true` such that the YAML header looks as follows: |
| 32 | + |
| 33 | +```{.yml filename="data_dictionary.qmd"} |
| 34 | +--- |
| 35 | +title: "Data Dictionary" |
| 36 | +format: |
| 37 | + html: |
| 38 | + embed-resources: true |
| 39 | +--- |
| 40 | +``` |
| 41 | + |
| 42 | +Then, save it as `data_dictionary.qmd` by clicking on _File_ > _Save_. |
| 43 | + |
| 44 | +To create the actual data dictionary, first write a description for all columns |
| 45 | +so others can understand what the variable names mean. |
| 46 | +Where necessary, also document their value |
| 47 | +-- this is especially important if their meaning is non-obvious. |
| 48 | +In the following, we demonstrate this by storing the penguins' binomial name |
| 49 | +along with the English name. |
| 50 | + |
| 51 | +``````{cat} |
| 52 | +#| engine.opts: { file: "_data_dictionary.qmd" } |
| 53 | +#| class.source: "md" |
| 54 | +#| filename: "data_dictionary.qmd" |
| 55 | +
|
| 56 | +```{r} |
| 57 | +#| echo: false |
| 58 | +
|
| 59 | +# Store the description of variables |
| 60 | +vars <- c( |
| 61 | + species = "a character string denoting penguin species", |
| 62 | + island = "a character string denoting island in Palmer Archipelago, Antarctica", |
| 63 | + bill_length_mm = "a number denoting bill length (millimeters)", |
| 64 | + bill_depth_mm = "a number denoting bill depth (millimeters)", |
| 65 | + flipper_length_mm = "an integer denoting flipper length (millimeters)", |
| 66 | + body_mass_g = "an integer denoting body mass (grams)", |
| 67 | + sex = "a character string denoting penguin sex", |
| 68 | + year = "an integer denoting the study year" |
| 69 | +) |
| 70 | +
|
| 71 | +# Store the description of variable values |
| 72 | +vals <- list( |
| 73 | + species = c( |
| 74 | + Adelie = "Pygoscelis adeliae", |
| 75 | + Gentoo = "Pygoscelis papua", |
| 76 | + Chinstrap = "Pygoscelis antarcticus" |
| 77 | + ) |
| 78 | +) |
| 79 | +``` |
| 80 | +`````` |
| 81 | + |
| 82 | +Then, load the data and use `datawizard` |
| 83 | +to add the descriptions to the `data.frame`:[^not-permanent] |
| 84 | + |
| 85 | +[^not-permanent]: Note that the code provided does not alter the data file |
| 86 | +-- no description will be added to `data.csv`. |
| 87 | +The descriptions are only added to a (temporary) copy of the data set within R |
| 88 | +to create the data dictionary. |
| 89 | + |
| 90 | +::: {.column-margin} |
| 91 | +{width=250px} |
| 92 | +::: |
| 93 | + |
| 94 | +``````{cat} |
| 95 | +#| engine.opts: { file: "_data_dictionary.qmd", append: TRUE } |
| 96 | +#| class.source: "md" |
| 97 | +#| filename: "data_dictionary.qmd" |
| 98 | +
|
| 99 | +```{r} |
| 100 | +#| echo: false |
| 101 | +
|
| 102 | +dat <- read.csv("data.csv") |
| 103 | +
|
| 104 | +for (x in names(vars)) { |
| 105 | + if (x %in% names(vals)) { |
| 106 | + dat <- datawizard::assign_labels( |
| 107 | + dat, |
| 108 | + select = I(x), |
| 109 | + variable = vars[[x]], |
| 110 | + values = vals[[x]] |
| 111 | + ) |
| 112 | + } else { |
| 113 | + dat <- datawizard::assign_labels( |
| 114 | + dat, |
| 115 | + select = I(x), |
| 116 | + variable = vars[[x]] |
| 117 | + ) |
| 118 | + } |
| 119 | +} |
| 120 | +``` |
| 121 | +
|
| 122 | +`````` |
| 123 | + |
| 124 | +Then, you can create the data dictionary containing the descriptions, |
| 125 | +but also some other information about each variable |
| 126 | +(e.g., the number of missing values) and print it. |
| 127 | + |
| 128 | +``````{cat} |
| 129 | +#| engine.opts: { file: "_data_dictionary.qmd", append: TRUE } |
| 130 | +#| class.source: "md" |
| 131 | +#| filename: "data_dictionary.qmd" |
| 132 | +
|
| 133 | +```{r} |
| 134 | +#| echo: false |
| 135 | +#| column: "body-outset" |
| 136 | +#| classes: plain |
| 137 | +
|
| 138 | +datawizard::data_codebook(dat) |> |
| 139 | + datawizard::data_select(exclude = ID) |> |
| 140 | + datawizard::data_filter(N != "") |> |
| 141 | + datawizard::print_md() |
| 142 | +``` |
| 143 | +
|
| 144 | +`````` |
| 145 | + |
| 146 | +```{r} |
| 147 | +#| child: "_data_dictionary.qmd" |
| 148 | +
|
| 149 | +``` |
| 150 | + |
| 151 | +Depending on the type of data, it may also be necessary |
| 152 | +to describe sampling procedures (e.g., selection criteria), |
| 153 | +measurement instruments (e.g., questionnaires), |
| 154 | +appropriate weighting, |
| 155 | +already applied preprocessing steps, or contact information. |
| 156 | +In our case, as the data has already been published, |
| 157 | +we only store a reference to its source. |
| 158 | + |
| 159 | +The data set is from the R package `palmerpenguins`. |
| 160 | +If you had it installed |
| 161 | +you could use the function `citation()` to create such a reference: |
| 162 | + |
| 163 | +```{r} |
| 164 | +#| label: "data-citation" |
| 165 | +#| eval: false |
| 166 | +
|
| 167 | +citation("palmerpenguins", auto = TRUE) |> |
| 168 | + format(bibtex = FALSE, style = "text") |
| 169 | +``` |
| 170 | + |
| 171 | +Without the package `palmerpenguins` installed, |
| 172 | +you can find a [suggested citation on its website][palmerpenguins-citation] |
| 173 | +and add that to your data dictionary: |
| 174 | + |
| 175 | +[palmerpenguins-citation]: https://allisonhorst.github.io/palmerpenguins/#citation |
| 176 | + |
| 177 | +```{r} |
| 178 | +#| ref.label = "data-citation", |
| 179 | +#| render = function(x, options) gsub("\\n", " ", x = x), |
| 180 | +#| echo = FALSE, |
| 181 | +#| class.output = "md code-overflow-wrap", |
| 182 | +#| attr.output = 'filename="data_dictionary.qmd"' |
| 183 | +
|
| 184 | +# This chunk takes the output from the chunk "data-citation" |
| 185 | +# and renders it with all newlines replaced by whitespaces. |
| 186 | +``` |
| 187 | + |
| 188 | +Finally, you can render the data dictionary by running the following: |
| 189 | + |
| 190 | +```{.bash filename="Terminal"} |
| 191 | +quarto render data_dictionary.qmd |
| 192 | +``` |
| 193 | + |
| 194 | +This should create the file `data_dictionary.html` |
| 195 | +which you open and view in your web browser. |
| 196 | + |
| 197 | +If you want to learn more about the sharing of research data, |
| 198 | +have a look at the tutorial "[FAIR research data management][fair-tutorial]". |
| 199 | + |
| 200 | +[fair-tutorial]: https://lmu-osc.github.io/FAIR-Data-Management/ |
| 201 | + |
| 202 | +## Create Machine-Readable Variable Documentation |
| 203 | + |
| 204 | +One could go even further by making the information machine-readable in a standardized way. |
| 205 | + |
| 206 | +This section demonstrates how the title and description of the data set, |
| 207 | +the description of the variables and their valid values are stored in a machine-readable way. |
| 208 | +We'll reuse the descriptions we already created[^value-labels] and add a few others. |
| 209 | + |
| 210 | +[^value-labels]: Unfortunately, the descriptions of values are not reused in this example, |
| 211 | +as they are [not supported][enum-labels] by the specification we are using. |
| 212 | + |
| 213 | +[enum-labels]: https://specs.frictionlessdata.io/patterns/#table-schema-enum-labels-and-ordering |
| 214 | + |
| 215 | +First, store the title and description of the data set as a whole: |
| 216 | + |
| 217 | +```{.r filename="Console"} |
| 218 | +table_info <- c( |
| 219 | + title = "penguins data set", |
| 220 | + description = "Size measurements for adult foraging penguins near Palmer Station, Antarctica" |
| 221 | +) |
| 222 | +``` |
| 223 | + |
| 224 | +As before, also provide a reference to the source. |
| 225 | + |
| 226 | +```{r} |
| 227 | +#| echo: false |
| 228 | +#| class-output: "r code-overflow-wrap" |
| 229 | +#| attr-output: 'filename="Console"' |
| 230 | +
|
| 231 | +# We have provided the data set as CSV file to the readers. |
| 232 | +# Therefore, we cannot assume or require that readers have |
| 233 | +# the R package palmerpenguins installed. Instead, we create |
| 234 | +# the citation on our end and hide how we obtained it. |
| 235 | +
|
| 236 | +citation("palmerpenguins", auto = TRUE)$url |> |
| 237 | + paste0("dat_source <- \"", ... = _, "\"") |> |
| 238 | + cat() |
| 239 | +``` |
| 240 | + |
| 241 | +Next, create a list of the categorical variables' valid values: |
| 242 | + |
| 243 | +```{.r filename="Console"} |
| 244 | +valid_vals <- list( |
| 245 | + species = c("Adelie", "Gentoo", "Chinstrap"), |
| 246 | + island = c("Torgersen", "Biscoe", "Dream"), |
| 247 | + sex = c("male", "female"), |
| 248 | + year = c(2007, 2008, 2009) |
| 249 | +) |
| 250 | +``` |
| 251 | + |
| 252 | +Finally, store the descriptions of the variables we already created earlier: |
| 253 | + |
| 254 | +```{.r filename="Console"} |
| 255 | +# Store the description of variables |
| 256 | +vars <- c( |
| 257 | + species = "a character string denoting penguin species", |
| 258 | + island = "a character string denoting island in Palmer Archipelago, Antarctica", |
| 259 | + bill_length_mm = "a number denoting bill length (millimeters)", |
| 260 | + bill_depth_mm = "a number denoting bill depth (millimeters)", |
| 261 | + flipper_length_mm = "an integer denoting flipper length (millimeters)", |
| 262 | + body_mass_g = "an integer denoting body mass (grams)", |
| 263 | + sex = "a character string denoting penguin sex", |
| 264 | + year = "an integer denoting the study year" |
| 265 | +) |
| 266 | +``` |
| 267 | + |
| 268 | +Generally, metadata are either stored embedded into the data or externally, |
| 269 | +for example, in a separate file. |
| 270 | +We will use the "[frictionless data](https://frictionlessdata.io/)" standard, |
| 271 | +where metadata are stored separately. |
| 272 | +Another alternative would be [RO-Crate](https://www.researchobject.org/ro-crate/). |
| 273 | + |
| 274 | +Specifically, one can use the R package [`frictionless`][frictionless] |
| 275 | +to create a _schema_ which describes the structure of the data.[^frictionless-note] |
| 276 | +For the purpose of the following code, |
| 277 | +it is just a nested list that we edit to include our own information. |
| 278 | +We also explicitly record in the schema |
| 279 | +that missing values are stored in the data file as `NA` |
| 280 | +and that the data are licensed under CC0\ 1.0. |
| 281 | +Finally, the package is used to create a metadata file that contains the schema. |
| 282 | + |
| 283 | +[frictionless]: https://docs.ropensci.org/frictionless/ |
| 284 | + |
| 285 | +[^frictionless-note]: In June 2024, [version 2](https://datapackage.org/) |
| 286 | +of the frictionless data standard has been released. |
| 287 | +As of November 2024, the R package `frictionless` only supports the first version, |
| 288 | +though support for v2 is [planned](https://github.com/frictionlessdata/frictionless-r/labels/datapackage%3Av2). |
| 289 | + |
| 290 | +```{.r filename="Console"} |
| 291 | +# Install {frictionless} and the required dependency {stringi} |
| 292 | +renv::install(c( |
| 293 | + "frictionless", |
| 294 | + "stringi" |
| 295 | +)) |
| 296 | + |
| 297 | +# Read data and create schema |
| 298 | +dat_filename <- "data.csv" |
| 299 | +dat <- read.csv(dat_filename) |
| 300 | +dat_schema <- frictionless::create_schema(dat) |
| 301 | + |
| 302 | +# Add descriptions to the fields |
| 303 | +dat_schema$fields <- lapply(dat_schema$fields, \(x) { |
| 304 | + c(x, description = vars[[x$name]]) |
| 305 | +}) |
| 306 | + |
| 307 | +# Record valid values |
| 308 | +dat_schema$fields <- lapply(dat_schema$fields, \(x) { |
| 309 | + if (x[["name"]] %in% names(valid_vals)) { |
| 310 | + modifyList(x, list(constraints = list(enum = valid_vals[[x$name]]))) |
| 311 | + } else { |
| 312 | + x |
| 313 | + } |
| 314 | +}) |
| 315 | + |
| 316 | +# Define missing values |
| 317 | +dat_schema$missingValues <- c("", "NA") |
| 318 | + |
| 319 | +# Create package with license info and write it |
| 320 | +dat_package <- frictionless::create_package() |> |
| 321 | + frictionless::add_resource( |
| 322 | + resource_name = "penguins", |
| 323 | + data = dat_filename, |
| 324 | + schema = dat_schema, |
| 325 | + title = table_info[["title"]], |
| 326 | + description = table_info[["description"]], |
| 327 | + licenses = list(list( |
| 328 | + name = "CC0-1.0", |
| 329 | + path = "https://creativecommons.org/publicdomain/zero/1.0/", |
| 330 | + title = "CC0 1.0 Universal" |
| 331 | + )), |
| 332 | + sources = list(list( |
| 333 | + title = "CRAN", |
| 334 | + path = dat_source |
| 335 | + )) |
| 336 | + ) |
| 337 | +frictionless::write_package(dat_package, directory = ".") |
| 338 | +``` |
| 339 | + |
| 340 | +This creates the metadata file `datapackage.json` in the current directory. |
| 341 | +Make sure it is located in the same folder as `data.csv`, |
| 342 | +as together they comprise a [data package](https://specs.frictionlessdata.io/data-package/). |
0 commit comments