Skip to content

Commit 4453c67

Browse files
committed
Shorten the data chapter
1 parent 13f8c93 commit 4453c67

File tree

9 files changed

+1102
-871
lines changed

9 files changed

+1102
-871
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,5 +62,5 @@ _outline.md
6262

6363
# Other files created by the project
6464
grateful-refs.bib
65-
data.csv
65+
#data.csv # we keep this file in the repo
6666
_data_dictionary.qmd

_quarto.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,8 @@ website:
5555
contents:
5656
- href: in_depth_material/introduction_copyright.qmd
5757
text: "Introduction to Copyright and Licensing"
58+
- href: in_depth_material/data_dic_generation.qmd
59+
text: "Automatic Generation of Data Dictionaries"
5860
# End Website
5961

6062

copyright.qmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -518,7 +518,7 @@ Also, their commercial use may require the consent of the depicted person.
518518

519519
[commons-photographs]: https://commons.wikimedia.org/wiki/Commons:Photographs_of_identifiable_people
520520

521-
## Practical Exercise: Adding an Image
521+
## ✍️ Practical Exercise: Adding an Image
522522

523523
Let's practice what you learned by adding an image to the manuscript.
524524
We'll use [this picture of a penguin from Flickr](https://flic.kr/p/2pEKnUr).

data.csv

Lines changed: 345 additions & 0 deletions
Large diffs are not rendered by default.

data.qmd

Lines changed: 65 additions & 370 deletions
Large diffs are not rendered by default.

in_depth_material/data.csv

Lines changed: 345 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 342 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,342 @@
1+
---
2+
title: "Automatic Generation of Data Dictionaries"
3+
engine: knitr
4+
---
5+
6+
**Note: This is an add-on to the Chapter "[Add Data and Data Dictionary](/data.qmd)". It describes how you can (a) automatically generate data dictionaries with an R package, and (b) how to create a machine readable documentation of your data.**
7+
8+
## Automatic Generation of Data Dictionaries
9+
10+
First, we will demonstrate how to create a simple data dictionary
11+
using the R package [`datawizard`][datawizard]. We will use the penguin data set which is introduced in the Chapter "[Add Data and Data Dictionary](/data.qmd)".
12+
You can download it and put it into your project folder:
13+
14+
[data.csv](../data.csv){.btn .btn-lg .btn-info download="data.csv"}
15+
16+
You can install the `datawizard` package into our `renv` environment using:
17+
18+
[datawizard]: https://easystats.github.io/datawizard/
19+
20+
```{.r filename="Console"}
21+
renv::install("datawizard")
22+
```
23+
24+
We create a separate Quarto file for the data dictionary.
25+
Create it by clicking on _File_ > _New File_ > _Quarto Document..._.
26+
Choose a title such as `Data Dictionary`,
27+
select _HTML_ as format,
28+
uncheck the use of the visual markdown editor, and click on _Create_.
29+
Remove everything except the YAML header (between the `---`).
30+
To make the HTML file self-contained,
31+
also set `embed-resources: true` such that the YAML header looks as follows:
32+
33+
```{.yml filename="data_dictionary.qmd"}
34+
---
35+
title: "Data Dictionary"
36+
format:
37+
html:
38+
embed-resources: true
39+
---
40+
```
41+
42+
Then, save it as `data_dictionary.qmd` by clicking on _File_ > _Save_.
43+
44+
To create the actual data dictionary, first write a description for all columns
45+
so others can understand what the variable names mean.
46+
Where necessary, also document their value
47+
-- this is especially important if their meaning is non-obvious.
48+
In the following, we demonstrate this by storing the penguins' binomial name
49+
along with the English name.
50+
51+
``````{cat}
52+
#| engine.opts: { file: "_data_dictionary.qmd" }
53+
#| class.source: "md"
54+
#| filename: "data_dictionary.qmd"
55+
56+
```{r}
57+
#| echo: false
58+
59+
# Store the description of variables
60+
vars <- c(
61+
species = "a character string denoting penguin species",
62+
island = "a character string denoting island in Palmer Archipelago, Antarctica",
63+
bill_length_mm = "a number denoting bill length (millimeters)",
64+
bill_depth_mm = "a number denoting bill depth (millimeters)",
65+
flipper_length_mm = "an integer denoting flipper length (millimeters)",
66+
body_mass_g = "an integer denoting body mass (grams)",
67+
sex = "a character string denoting penguin sex",
68+
year = "an integer denoting the study year"
69+
)
70+
71+
# Store the description of variable values
72+
vals <- list(
73+
species = c(
74+
Adelie = "Pygoscelis adeliae",
75+
Gentoo = "Pygoscelis papua",
76+
Chinstrap = "Pygoscelis antarcticus"
77+
)
78+
)
79+
```
80+
``````
81+
82+
Then, load the data and use `datawizard`
83+
to add the descriptions to the `data.frame`:[^not-permanent]
84+
85+
[^not-permanent]: Note that the code provided does not alter the data file
86+
-- no description will be added to `data.csv`.
87+
The descriptions are only added to a (temporary) copy of the data set within R
88+
to create the data dictionary.
89+
90+
::: {.column-margin}
91+
![datawizard: Easy Data Wrangling and Statistical Transformations](../images/datawizard.png){width=250px}
92+
:::
93+
94+
``````{cat}
95+
#| engine.opts: { file: "_data_dictionary.qmd", append: TRUE }
96+
#| class.source: "md"
97+
#| filename: "data_dictionary.qmd"
98+
99+
```{r}
100+
#| echo: false
101+
102+
dat <- read.csv("data.csv")
103+
104+
for (x in names(vars)) {
105+
if (x %in% names(vals)) {
106+
dat <- datawizard::assign_labels(
107+
dat,
108+
select = I(x),
109+
variable = vars[[x]],
110+
values = vals[[x]]
111+
)
112+
} else {
113+
dat <- datawizard::assign_labels(
114+
dat,
115+
select = I(x),
116+
variable = vars[[x]]
117+
)
118+
}
119+
}
120+
```
121+
122+
``````
123+
124+
Then, you can create the data dictionary containing the descriptions,
125+
but also some other information about each variable
126+
(e.g., the number of missing values) and print it.
127+
128+
``````{cat}
129+
#| engine.opts: { file: "_data_dictionary.qmd", append: TRUE }
130+
#| class.source: "md"
131+
#| filename: "data_dictionary.qmd"
132+
133+
```{r}
134+
#| echo: false
135+
#| column: "body-outset"
136+
#| classes: plain
137+
138+
datawizard::data_codebook(dat) |>
139+
datawizard::data_select(exclude = ID) |>
140+
datawizard::data_filter(N != "") |>
141+
datawizard::print_md()
142+
```
143+
144+
``````
145+
146+
```{r}
147+
#| child: "_data_dictionary.qmd"
148+
149+
```
150+
151+
Depending on the type of data, it may also be necessary
152+
to describe sampling procedures (e.g., selection criteria),
153+
measurement instruments (e.g., questionnaires),
154+
appropriate weighting,
155+
already applied preprocessing steps, or contact information.
156+
In our case, as the data has already been published,
157+
we only store a reference to its source.
158+
159+
The data set is from the R package `palmerpenguins`.
160+
If you had it installed
161+
you could use the function `citation()` to create such a reference:
162+
163+
```{r}
164+
#| label: "data-citation"
165+
#| eval: false
166+
167+
citation("palmerpenguins", auto = TRUE) |>
168+
format(bibtex = FALSE, style = "text")
169+
```
170+
171+
Without the package `palmerpenguins` installed,
172+
you can find a [suggested citation on its website][palmerpenguins-citation]
173+
and add that to your data dictionary:
174+
175+
[palmerpenguins-citation]: https://allisonhorst.github.io/palmerpenguins/#citation
176+
177+
```{r}
178+
#| ref.label = "data-citation",
179+
#| render = function(x, options) gsub("\\n", " ", x = x),
180+
#| echo = FALSE,
181+
#| class.output = "md code-overflow-wrap",
182+
#| attr.output = 'filename="data_dictionary.qmd"'
183+
184+
# This chunk takes the output from the chunk "data-citation"
185+
# and renders it with all newlines replaced by whitespaces.
186+
```
187+
188+
Finally, you can render the data dictionary by running the following:
189+
190+
```{.bash filename="Terminal"}
191+
quarto render data_dictionary.qmd
192+
```
193+
194+
This should create the file `data_dictionary.html`
195+
which you open and view in your web browser.
196+
197+
If you want to learn more about the sharing of research data,
198+
have a look at the tutorial "[FAIR research data management][fair-tutorial]".
199+
200+
[fair-tutorial]: https://lmu-osc.github.io/FAIR-Data-Management/
201+
202+
## Create Machine-Readable Variable Documentation
203+
204+
One could go even further by making the information machine-readable in a standardized way.
205+
206+
This section demonstrates how the title and description of the data set,
207+
the description of the variables and their valid values are stored in a machine-readable way.
208+
We'll reuse the descriptions we already created[^value-labels] and add a few others.
209+
210+
[^value-labels]: Unfortunately, the descriptions of values are not reused in this example,
211+
as they are [not supported][enum-labels] by the specification we are using.
212+
213+
[enum-labels]: https://specs.frictionlessdata.io/patterns/#table-schema-enum-labels-and-ordering
214+
215+
First, store the title and description of the data set as a whole:
216+
217+
```{.r filename="Console"}
218+
table_info <- c(
219+
title = "penguins data set",
220+
description = "Size measurements for adult foraging penguins near Palmer Station, Antarctica"
221+
)
222+
```
223+
224+
As before, also provide a reference to the source.
225+
226+
```{r}
227+
#| echo: false
228+
#| class-output: "r code-overflow-wrap"
229+
#| attr-output: 'filename="Console"'
230+
231+
# We have provided the data set as CSV file to the readers.
232+
# Therefore, we cannot assume or require that readers have
233+
# the R package palmerpenguins installed. Instead, we create
234+
# the citation on our end and hide how we obtained it.
235+
236+
citation("palmerpenguins", auto = TRUE)$url |>
237+
paste0("dat_source <- \"", ... = _, "\"") |>
238+
cat()
239+
```
240+
241+
Next, create a list of the categorical variables' valid values:
242+
243+
```{.r filename="Console"}
244+
valid_vals <- list(
245+
species = c("Adelie", "Gentoo", "Chinstrap"),
246+
island = c("Torgersen", "Biscoe", "Dream"),
247+
sex = c("male", "female"),
248+
year = c(2007, 2008, 2009)
249+
)
250+
```
251+
252+
Finally, store the descriptions of the variables we already created earlier:
253+
254+
```{.r filename="Console"}
255+
# Store the description of variables
256+
vars <- c(
257+
species = "a character string denoting penguin species",
258+
island = "a character string denoting island in Palmer Archipelago, Antarctica",
259+
bill_length_mm = "a number denoting bill length (millimeters)",
260+
bill_depth_mm = "a number denoting bill depth (millimeters)",
261+
flipper_length_mm = "an integer denoting flipper length (millimeters)",
262+
body_mass_g = "an integer denoting body mass (grams)",
263+
sex = "a character string denoting penguin sex",
264+
year = "an integer denoting the study year"
265+
)
266+
```
267+
268+
Generally, metadata are either stored embedded into the data or externally,
269+
for example, in a separate file.
270+
We will use the "[frictionless data](https://frictionlessdata.io/)" standard,
271+
where metadata are stored separately.
272+
Another alternative would be [RO-Crate](https://www.researchobject.org/ro-crate/).
273+
274+
Specifically, one can use the R package [`frictionless`][frictionless]
275+
to create a _schema_ which describes the structure of the data.[^frictionless-note]
276+
For the purpose of the following code,
277+
it is just a nested list that we edit to include our own information.
278+
We also explicitly record in the schema
279+
that missing values are stored in the data file as `NA`
280+
and that the data are licensed under CC0\ 1.0.
281+
Finally, the package is used to create a metadata file that contains the schema.
282+
283+
[frictionless]: https://docs.ropensci.org/frictionless/
284+
285+
[^frictionless-note]: In June 2024, [version 2](https://datapackage.org/)
286+
of the frictionless data standard has been released.
287+
As of November 2024, the R package `frictionless` only supports the first version,
288+
though support for v2 is [planned](https://github.com/frictionlessdata/frictionless-r/labels/datapackage%3Av2).
289+
290+
```{.r filename="Console"}
291+
# Install {frictionless} and the required dependency {stringi}
292+
renv::install(c(
293+
"frictionless",
294+
"stringi"
295+
))
296+
297+
# Read data and create schema
298+
dat_filename <- "data.csv"
299+
dat <- read.csv(dat_filename)
300+
dat_schema <- frictionless::create_schema(dat)
301+
302+
# Add descriptions to the fields
303+
dat_schema$fields <- lapply(dat_schema$fields, \(x) {
304+
c(x, description = vars[[x$name]])
305+
})
306+
307+
# Record valid values
308+
dat_schema$fields <- lapply(dat_schema$fields, \(x) {
309+
if (x[["name"]] %in% names(valid_vals)) {
310+
modifyList(x, list(constraints = list(enum = valid_vals[[x$name]])))
311+
} else {
312+
x
313+
}
314+
})
315+
316+
# Define missing values
317+
dat_schema$missingValues <- c("", "NA")
318+
319+
# Create package with license info and write it
320+
dat_package <- frictionless::create_package() |>
321+
frictionless::add_resource(
322+
resource_name = "penguins",
323+
data = dat_filename,
324+
schema = dat_schema,
325+
title = table_info[["title"]],
326+
description = table_info[["description"]],
327+
licenses = list(list(
328+
name = "CC0-1.0",
329+
path = "https://creativecommons.org/publicdomain/zero/1.0/",
330+
title = "CC0 1.0 Universal"
331+
)),
332+
sources = list(list(
333+
title = "CRAN",
334+
path = dat_source
335+
))
336+
)
337+
frictionless::write_package(dat_package, directory = ".")
338+
```
339+
340+
This creates the metadata file `datapackage.json` in the current directory.
341+
Make sure it is located in the same folder as `data.csv`,
342+
as together they comprise a [data package](https://specs.frictionlessdata.io/data-package/).

in_depth_material/introduction_copyright.qmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ format:
1616
</style>
1717
---
1818

19-
**Note: This is an expanded version of the Chapter "[Sharing work of others: Copyright](/copyright.qmd)", where you find more details on the choice of licenses.**
19+
**Note: This is an expanded version of the Chapter "[Sharing work of others: Copyright](/copyright.qmd)". Here you find more details on the choice of licenses.**
2020

2121
By default, everything you put into the project folder will be shared publicly.
2222
In many instances, this will also include works by others than yourself or your co-authors,

0 commit comments

Comments
 (0)