Skip to content

Commit 31d7d32

Browse files
authored
Merge pull request #56 from thomvolker/main
Final changes and section on publishing synthetic data
2 parents 367c65c + 1a590f3 commit 31d7d32

File tree

10 files changed

+274
-270
lines changed

10 files changed

+274
-270
lines changed

_quarto.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,8 @@ website:
5252
href: generation/index.qmd
5353
- text: "Evaluating synthetic data quality"
5454
href: evaluation/index.qmd
55+
- text: "Publishing and using synthetic data"
56+
href: publishing/index.qmd
5557
# End Website
5658

5759

data/boys.RDS

-27 Bytes
Binary file not shown.

evaluation/index.qmd

Lines changed: 117 additions & 226 deletions
Large diffs are not rendered by default.

generation/index.qmd

Lines changed: 25 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -211,11 +211,15 @@ That is, skewness or other deviations are preserved along with linear relationsh
211211
Specifically, `normrank` first transforms the observed data to standard normal quantiles based on the relative ranks of the values.
212212
Subsequently, a linear regression model is fitted on these quantiles using all predictors that were synthesized previously.
213213
Finally, synthetic values are drawn from the observed data by backtransforming the synthetic quantiles to ranks, and matching the ranks with the observed data.
214+
This approach better approximates the marginal (univariate) distribution of the observed data in case of non-normality, but distorts linear relationships between variables.
215+
For this reason, we set the default parametric method to `"norm"` in the script below.
216+
217+
When the data is not numerical, `synthpop` treats it as a categorical variable.
214218
Categorical variables with two categories are synthesized with a logistic regression model and categorical variables with more than two categories are synthesized using ordered or unordered polytomous regression (which is an extension to logistic regression for more than two categories), depending on whether the categories are ordered or not, respectively.
215219

216220
---
217221

218-
__4. Use `synthpop()` to create a synthetic data set in an object called `syn_param` using `method = "parametric"`. Use `seed = 123` if you want to replicate our results.__
222+
__4. Use `synthpop()` to create a synthetic data set in an object called `syn_param` using `method = "parametric"`, but set `default.method = c("norm", "logreg", "polyreg", "polr")`. Use `seed = 123` if you want to replicate our results.__
219223

220224
:::{.callout-tip title="Show solution" collapse=true}
221225

@@ -225,6 +229,7 @@ __4. Use `synthpop()` to create a synthetic data set in an object called `syn_pa
225229
syn_param <- syn(
226230
data = data,
227231
method = "parametric",
232+
default.method = c("norm", "logreg", "polyreg", "polr"),
228233
seed = 123,
229234
print.flag = FALSE
230235
)
@@ -250,9 +255,11 @@ Also, it shows for every variable the method that was used to synthesize the dat
250255
If you want to know more about a specific synthesis method, say, for example, `logreg`, you can call `?syn.logreg` to get more information.
251256

252257

253-
If all is well, all continuous variables are strictly positive, which is due to matching the synthetic ranks with observed ranks and sampling the corresponding values from this (as explained above).
254-
However, there is a problem that you might have noticed.
255-
The variable `bmi` is not equal to `wgt / (hgt/100)^2`.
258+
Because we assumed that the synthetic variables are drawn from a normal distribution, there might be values outside the typical range.
259+
For example, the fifth value for the variable `wgt` is negative, which is impossible.
260+
One way to deal with this is reverting back to the `"normrank"` method.
261+
Another approach is to use another transformation, or using another model (which we will do later in this tutorial).
262+
Another problem that you might notice, is that the variable `bmi` is not equal to `wgt / (hgt/100)^2`.
256263
This issue can be fixed using _passive_ synthesis, which will be demonstrated in the next section.
257264

258265
:::
@@ -284,11 +291,21 @@ method["bmi"] <- "~I(wgt/(hgt/100)^2)"
284291
method
285292
```
286293

287-
With this specification, `synthpop` knows that it should not use an imputation model, but rather use the synthetic `hgt` and `wgt` values to construct the `bmi` values deterministically.
294+
With this specification, `synthpop` knows that it should not use an statistical model, but rather use the synthetic `hgt` and `wgt` values to construct the `bmi` values deterministically.
295+
Note that, when specifying passive synthesis models, it is essential that these variables are synthesized after the other variables, because otherwise the equation cannot be applied.
296+
Moreover, it can improve utility to perform passive synthesis after imputing all other variables.
297+
Specifically, if some relationships between the variables are not captured adequately before passive synthesis is applied, relationships with variables synthesized later may also be distorted.
298+
This can be taken into account by specifying the `visit.sequence` argument, which defines the sequence in which variables are synthesized.
299+
300+
```{r}
301+
#| label: visit-sequence
302+
303+
visit <- c("age", "hgt", "wgt", "hc", "gen", "phb", "tv", "reg", "bmi")
304+
```
288305

289306
---
290307

291-
__6. Use `synthpop()` to create a synthetic data set in an object called `syn_passive` using the adjusted `method` vector. Again, use `seed = 123` if you want to replicate our results. Inspect the output.__
308+
__6. Use `synthpop()` to create a synthetic data set in an object called `syn_passive` using the adjusted `method` vector and `visit.sequence`. Again, use `seed = 123` if you want to replicate our results. Inspect the output.__
292309

293310
:::{.callout-tip title="Show solution" collapse=true}
294311

@@ -298,6 +315,7 @@ __6. Use `synthpop()` to create a synthetic data set in an object called `syn_pa
298315
syn_passive <- syn(
299316
data = data,
300317
method = method,
318+
visit.sequence = visit,
301319
seed = 123,
302320
print = FALSE
303321
)
@@ -378,7 +396,7 @@ If you want to learn how this happens in practice, it is probably easiest to che
378396
Using this procedure, you can combine modelling functions that are self-written or stem from alternative packages with the synthesis procedure in `synthpop`.
379397
This also allows to model, for example, a hierarchical structure, using the `R`-package `lme4` [@lme4], `glmmTMB` [@glmmTMB] or, if you prefer a Bayesian procedure, with `brms` [@brms].
380398
In such instances, one can incorporate the hierarchical structure replacing the single level models with their hierarchical counterparts from these packages.
381-
Since no one has done this for synthetic data yet, relevant resources do not seem to exist, but feel free to reach out if you plan on doing this.
399+
Since, as far as we are aware of, no one has done this for synthetic data yet, relevant resources do not seem to to be widely documented yet, but feel free to reach out if you plan on doing this.
382400
To read more about modelling hierarchical data in the context of imputation for missing data, see @yucel_sequential_2017; @speidel_hmi_2020].
383401

384402
# Conclusion

index.qmd

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,13 @@ Moreover, realistic synthetic data can be used in teaching, or for starting with
1313
All in all, synthetic data makes open science practices easier and might spark collaborations with potential data users.
1414

1515

16-
The tutorial is intended to take approximately 2-3 hours to complete, and is split into the following sections:
16+
The tutorial focuses on the generation of tabular synthetic data. This precludes the generation of text data or other high-dimensional data sources (fMRI, intensive longitudinal data, etc.). Some extensions to multilevel-data are discussed, but this is not the main goal of the materials. In total, the tutorial is intended to take approximately 2-3 hours to complete, and is split into the following sections:
1717

1818
1. [Statistical Disclosure Control](sdc/) provides a very brief introduction to methods used to protect individuals’ private information, while keeping the released data useful for analysis.
1919
2. [Synthetic data: The general idea](synthetic/) conceptually introduces the idea of synthetic data and contains an optional section on coding your own simple synthesizer.
2020
3. [Generating synthetic data](geneneration/) introduces the idea of synthetic data and outlines how it can be generated in `R`.
2121
4. [Evaluating synthetic data quality](evaluation/) addresses the privacy-utility trade-off, and discusses how the quality of synthetic data can be evaluated from both sides of this trade-off.
22+
5. [Publishing and using synthetic data](publishing/) discusses when and how to publish synthetic data safely.
2223

2324
At the end of this tutorial, you will know what synthetic data is and why it is useful, have experience with generating synthetic data, and know how to think about whether the data is fit for release.
2425

publishing/index.qmd

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
---
2+
title: "Publishing and using synthetic data"
3+
bibliography: ../references.bib
4+
page-navigation: true
5+
---
6+
7+
# Publishing synthetic data
8+
9+
_There are currently (beginning of 2026) no widely accepted guidelines on publishing synthetic data, so the below are our own convictions. You may disagree with this, and any criticism on this is much appreciated. If you have suggestions or concerns, please feel free to create a [GitHub issue here](https://github.com/lmu-osc/synthetic-data-tutorial/issues)._
10+
11+
## Safe dissemination
12+
13+
For data disseminators, it is essential that privacy protection is embedded in the synthetic data generation pipeline.
14+
First, the data needs to be anonymized appropriately [see our other course on this topic](https://lmu-osc.github.io/data-anonymization/3_PROTECTION/3_1_Data_security_fundamentals.html).
15+
Additionally, it is important that no sensitive information is accidentally leaked through the synthetic data.
16+
The most obvious place to start, is to evaluate what information is present in the synthesis models.
17+
For complex, non-parametric models, this might be hard to determine, but for simpler parametric models, this is often doable.
18+
For example, when synthetic data is created using linear regression analyses with normal errors, all information that flows into the model are the regression coefficients, that are based on the means, variances and covariances of the observed data.
19+
If no individual has too large an effect on any of these parameters, then there is typically sufficient noise to protect the privacy of the sampled individuals.
20+
For more complex, or non-parametric models, it is good to consider how flexible the used model is.
21+
As we have seen, overly flexible models may reproduce the original data, thus producing synthetic data that is too close to the original.
22+
23+
When in doubt about potentially remaining privacy risks, think about the sensitivity of the data, and are there any pre-processing steps that you want to take care of (such as removing outliers, removing or collapsing very rare categories).
24+
For very sensitive data, you might also consider employing differentially private synthesis techniques (see the [section on statistical disclosure control here](evaluation/)).
25+
It is always advisable to complete all steps of the statistical disclosure control tools (e.g., the `sdc()` function) that are implemented in `synthpop`.
26+
This can flag potential problems, and potential problems here might provide a starting point to look back into your synthesis models.
27+
Finally, errors may occur, as in any data analysis pipeline, and it is important to remain critical to spot these.
28+
Always prioritize privacy over data utility, as data users can always request access to the observed data (potentially through a secure server or by planning a research visit).
29+
However, once unsafe data is published, it is impossible to fix the mistake.
30+
31+
## Practical advice on sharing synthetic data
32+
33+
For every release of synthetic data, data users would want to know what the synthetic data can be used for.
34+
In principle, synthetic data suffers from the same problems as the collected data (in terms of sampling bias and measurement error).
35+
On top of this comes additional uncertainty regarding the quality of the synthesis procedure.
36+
It is thus important to be explicit on how the synthetic data was generated, as this will provide some guidance on what the synthetic data can be reasonably used for.
37+
If you generated synthetic data using solely linear models, then state so explicitly, as researchers should not attempt to evaluate certain non-linear effects.
38+
By explicitly modelled certain non-linear effects, you will increase the likelihood that these are indeed present in the synthetic data.
39+
When using non-parametric models, you can probably not be sure about which effects are in the synthetic data.
40+
This is okay, a data user should not expect that everything is possible with synthetic data.
41+
In any case, do not publish the synthesis models.
42+
That is, do not store the synthesis model and disseminate it, because it will contain additional information about the original data not originally contained in the synthetic data.
43+
However, the code used to generate synthetic data data can be safely published, because this documents how the data were generated, but does not reveal what was learned from the original data.
44+
If you do this, make sure that there is no information on individual subjects hidden in the code file (e.g., code to remove subject X with address Y and sensitive attribute Z).
45+
46+
47+
In principle, one should not publish more than two or three files.
48+
First, of course, is the synthetic data itself.
49+
Flag this file clearly as a synthetic data set.
50+
It is good practice to state it in the file name and the meta-data or a readme file, so that users do not accidentally confuse it with the real data.
51+
You might further consider prefixing all variables with the string "synthetic".
52+
There is only so much you can do in terms of enhancing transparency, and any ill-intentioned user can remove this information and further disseminate it.
53+
Second is any meta-data that applies to the data at hand.
54+
This can be a relevant codebook, or another file that is required to interpret the data at hand.
55+
Even though the underlying data is synthetic, users should be able to understand it before they can use it.
56+
If the collected data can be made accessible to trusted parties, then make this explicit in the meta-data as well.
57+
For example, there might be a secure server where these parties can request access to, or perhaps someone can request a research visit to work with the source data.
58+
Finally, the script used to generate the synthetic data can be disseminated, but make sure that you do not accidentally leak information from the additional data (either through the code file itself or through automatically saved history files).
59+
60+
61+
# Using synthetic data
62+
63+
64+
Synthetic data is not real data, and all results obtained from the synthetic data should be interpreted with care.
65+
Hence, do not publish with results that are solely based on synthetic data (in applied research projects).
66+
For empirical examples, for example in methodological research, this problem seems less severe, but it would be hard to state something about the accuracy of the results.
67+
Typically, the synthetic data is merely an intermediate data source, and you might want to consider running your analysis script on the actual data.
68+
If so, you might request access to the data, or prepare your analysis script on the synthetic data and request that the original authors run it on the observed data.
69+
If you do this, please keep the following in mind.
70+
71+
While synthetic data is not real data, it is based on the real data, which comes with some responsibilities.
72+
It might be tempting to think that because the data is not real, questionable research practices such as $p$-hacking and HARKing (hypothesizing after the results are known) cannot be a problem.
73+
However, noise that leads to spurious effects in the observed data, may be reproduced in the synthetic data.
74+
The extent to which data synthesis prevents problems related to questionable research practices still need to be investigated.
75+
Hence, if you want to use synthetic data to pre-register your own study, use it to determine whether your planned analysis can be reasonably executed, but not to determine which hypotheses to evaluate.
76+
If you do use the synthetic data for exploratory purposes, there is no way to circumvent the necessity of having to collect new data to do confirmatory tests of the hypotheses of interest.
77+
78+
Finally, remember that the generation of synthetic data is still a laborious process.
79+
If you use synthetic data, please provide proper attribution to the creators, and please, __please__, inform them if you suspect that you identify any privacy issues.
80+

references.bib

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,17 @@
1+
@article{bowen_dp_synthesis_2020,
2+
author = {Claire McKay Bowen and Fang Liu},
3+
journal = {Statistical Science},
4+
number = {2},
5+
doi={10.1214/19-STS742},
6+
pages = {pp. 280--307},
7+
publisher = {Institute of Mathematical Statistics},
8+
title = {Comparative Study of Differentially Private Data Synthesis Methods},
9+
urldate = {2026-01-19},
10+
volume = {35},
11+
year = {2020}
12+
}
13+
14+
115
@Article{brms,
216
title = {{brms}: An {R} Package for {Bayesian} Multilevel Models Using {Stan}},
317
author = {Paul-Christian Bürkner},
@@ -27,6 +41,15 @@ @Misc{densityratio
2741
doi = {10.5281/zenodo.13881689},
2842
}
2943

44+
@misc{desfontainesblog20180816,
45+
title = {Differential privacy in (a bit) more detail},
46+
author = {Damien Desfontaines},
47+
howpublished = {\url{https://desfontain.es/blog/differential-privacy-in-more-detail.html}},
48+
note = {Ted is writing things (personal blog)},
49+
year = {2018},
50+
month = {08}
51+
}
52+
3053
@misc{Desfontaines_2023,
3154
title={A friendly, non-technical introduction to Differential Privacy},
3255
url={https://desfontain.es/blog/friendly-intro-to-differential-privacy.html},

0 commit comments

Comments
 (0)