Skip to content

Commit 3b2b0a4

Browse files
authored
Update v1.7.0 (#21)
The 1.7.0 update introduces the new "rich" console mode, based on the Rich library, and refines the metrics output interface, allowing dynamic switching between modes without exess tech debt. The analysis target variable pattern has been completely overhauled to use a configurable object that handles all labels relevant to downstream task analyses, along with confounders, with finer-grained control. In addition, we introduce a timeout feature that allows long metrics to be interrupted and skipped, we improve error handling and configurability, and we add Maximum Mean Discrepancy, along with Feature Importance Overlap, to the metric library. **Refactor:** * Changed the `format_output` method in all metric classes to return a list of tuples suitable for rich console output, replacing previous string-based formatting. This affects the metric template, core metric class, and all metric implementations. * Changed the `analysis_target_var` to an `analysis_target` object: parsing is handled automatically to retain the simple interface, but the object can also be parsed as an object for superior fine-grained control. See the new section in `guides/syntheval_guide.ipynb` for a tutorial on how this object can be used. **New Features:** * Added a `console` argument for the main class, which allows specifying use of `rich` (new) or `ascii` (legacy) formatting for the console print or to turn `off` entirely. We added a check to automatically switch to `ascii` from `rich`if in a notebook environment to prevent crashing the terminal. * Added a timeout feature based on `asyncio`, for the main evaluation loop to allow interrupting and skipping of metrics that take too long to complete. By default, timeout is *not* enabled. * Added a `plot_figures` attribute to metric classes, allowing users to control figure plotting separately from verbosity. * Added a corresponding argument `enable_plots` to the SynthEval class, to control plotting in the main evaluation loop. * Added a `missing_directive` argument to the SynthEval class, so that the user can control if SynthEval should raise a warning, drop rows with missingness, or ignore that there is missingness and carry on. We added a small discussion in the `guides/preprocessing.md` guide on missingness, for users interested in other solutions. * Added the Maximum Mean Discrepancy (MMD) metric, recording both the biased and unbiased versions of the statistic. A detailed explanation and reference are added in `guides/metrics_references.md`. * Added the Feature Importance Overlap (FIO) metric, which checks two properties related to feature selection/ranking tasks. Namely, that the importance values assigned by a predictive model match (mean absolute error and weighted mean absolute error), and that the features recovered in a ranking at 5%, 10%, 25% and 50% of features are the same. The metric can also plot the top feature importance scores. A description of this metric is added in `guides/metrics_references.md`. * With the new `analysis_target` object, the increased flexibility allowed some important changes in the classification accuracy, auroc difference, attribute disclosure risk, and statistical parity metrics, now accounting for multiple potential labels, and dynamically removing confounders in prediction tasks (this is also considered in the new FIO metric). **Documentation:** * Minor details were added to the `README.md`. * Major reformatting of the metrics overview in `README.md` into tables with metric keywords and links to the method documentation, and in the few instances where applicable to the `guides/metrics_references.md`. * Guide codebooks were refreshed with the latest features, new metrics, and nicer printing. `guides/syntheval_guide.ipynb` now includes a new part on the `analysis_target` object. * New `guides/preprocessing.md` added to document preprocessing steps. * Adjusted a number of the metric docstrings. **Changes:** * Improved error handling in several metric scripts by raising `ValueError` instead of printing warnings or passing on failed assertions, ensuring clearer feedback for users and that errors are raised in the active console. * Changed default ranking system for the benchmark method, from `linear` (min-max sum) to `summation` (flat sum). * Changed the default preprocessing for PCA metric from "mean" to "std" for consistency. * Changed the default F1 setting in the classification metric from `micro`to `weighted`, for more saturation-aware behaviour on imbalanced classification problems. * Changed the default confidence interval unit for CIO and DWM metrics from `sem` to `std`. The new option `ci` can be used to switch back to the old behaviour if needed. * Attribute disclosure metric had the `sensitive` argument removed; now the sensitive attributes are parsed through the `analysis_target`object. * Statistical parity metric similarly had the `protected_attribute` argument removed, and now also uses the `sensitive_vars` attribute in the `analysis_target`object. In addition, the statistical parity metric can now evaluate multiple protected attributes. **Bug fixes:** * Hellinger distance metric had a division by zero error when determining binwidth when the interquartile range was 0: this error is now caught, and handles the binning in the non-monotonous case using variable binwidth. * Adding the timeout feature, caused a bunch of warnings from plotting outside of the main thread, on certain versions of matplotlib with tkinter; this is fixed by using the 'Agg' backend for plotting. In addition, the `async.io` caused a notebook crash on some versions, so we added a catch that replaces direct event-loop calls with a loop-self-helper using coroutines if a loop is already running. * Improved type casting to meet NumPy 2.x scalar handling (tests broke because of previously lazy type-handling) * Attribute disclosure risk, membership inference attack, and Kolmogorov-Smirnov test metrics would return undefined STD for STD calculation on lists of length 1 (which was not a huge problem, but would throw warnings); we made catches to avoid this.
1 parent 995b43b commit 3b2b0a4

43 files changed

Lines changed: 2857 additions & 1304 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/doctests.yml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,11 @@ jobs:
2525
- uses: actions/checkout@v4
2626
- uses: actions/setup-python@v5
2727
with:
28-
python-version: ">=3.9"
28+
python-version: ">=3.9"
29+
- name: Install OpenBLAS
30+
run: |
31+
sudo apt-get update
32+
sudo apt-get install -y libopenblas-dev
2933
- name: Install dependencies
3034
run: |
3135
python -m pip install --upgrade pip

.github/workflows/release.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ name: Upload Python Package
1111

1212
on:
1313
release:
14-
types: [published]
14+
types: [released, prereleased]
1515

1616
permissions:
1717
contents: read

README.md

Lines changed: 36 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -40,9 +40,9 @@ from syntheval import SynthEval
4040
evaluator = SynthEval(df_real, holdout_dataframe = df_test, cat_cols = class_cat_col)
4141
evaluator.evaluate(df_fake, class_lab_col, presets_file = "full_eval", **kwargs)
4242
```
43-
Where the user supplies <code>df_real, df_test, df_fake</code> as pandas dataframes, the <code>class_cat_col</code> is a complete list of column names (which can be omitted for categoricals to be automatically inferred). Some metrics require a target class, so <code>class_lab_col</code> is a string for designating one column with discrete values as a target for usability predictions and colouration. In the evaluate function, a presets file can be chosen ("full_eval", "fast_eval", or "privacy") or alternatively, a filepath can be supplied to a json file with select metrics keywords. Finally, instead of (or in addition to), keyword arguments can be added in the end with additional metrics and their options.
43+
Where the user supplies <code>df_real, df_test, df_fake</code> as pandas dataframes, the <code>class_cat_col</code> is a complete list of column names (which can be omitted for categoricals to be automatically inferred). Some metrics require a target class, so <code>class_lab_col</code> is a string (or list or [AnalysisConfig](guides/syntheval_guide.ipynb#Analysis-Target-Configuration) object) for designating variables as a target for downstream task prediction and plotting colouration. In the evaluate function, a presets file can be chosen ("full_eval", "fast_eval", or "privacy") or alternatively, a filepath can be supplied to a json file with select metrics keywords. Finally, instead of (or in addition to), keyword arguments can be added in the end with additional metrics and their options.
4444

45-
New in version 1.4 is the benchmark module, that allows a directory of synthetic datasets to be specified for evaluation (or a dictionary of dataframes). All datasets in the folder are evaluated against the training (and test) data on the selected metrics. Three types of rank-derived scoring are available to choose between ("linear", "normal", or "quantile"), assisting in identifying datasets that perform well overall, and on utility and privacy dimensions.
45+
Version 1.4 introduced the benchmark module, that allows a directory of synthetic datasets to be specified for evaluation (or a dictionary of dataframes). All datasets in the folder are evaluated against the training (and test) data on the selected metrics. Three types of rank-derived scoring are available to choose between ("linear", "normal", or "quantile"), assisting in identifying datasets that perform well overall, and on utility and privacy dimensions.
4646
```python
4747
evaluator.benchmark('local/path_to/target_dir/', class_lab_col, presets_file = "full_eval", rank_strategy='normal', **kwargs)
4848
```
@@ -54,6 +54,9 @@ For more details on how to use the library, see the codebooks below;
5454
| [Tutorial 1](guides/syntheval_guide.ipynb) | Get started, basic examples |
5555
| [Tutorial 2](guides/syntheval_benchmark.ipynb) | Dataset benchmark, evaluating and ranking synthetic datasets in bulk |
5656
| [Tutorial 3](https://github.com/schneiderkamplab/syntheval-model-benchmark-example/blob/main/syntheval_model_benchmark.ipynb) | Model benchmark example, evaluating and ranking models |
57+
| --- | --- |
58+
| [Preprocessing Reference](guides/preprocessing.md) | Documentation on data preprocessing steps |
59+
| [Metrics Reference](guides/metrics_references.md) | Documentation of the newer metrics that are not covered in the SynthEval paper |
5760

5861
### Command line interface
5962
SynthEval can also be run from the commandline with the following syntax:
@@ -76,46 +79,47 @@ Options:
7679
## Included metrics overview
7780
The SynthEval library comes equipped with a broad selection of metrics to evaluate various aspects of synthetic tabular data. One of the more interesting properties that makes SynthEval stand out is that many of the metrics have been carefully adapted to accept heterogeneous data. Distances between datapoints are (by default) handled using Gower's distance/similarity measure rather than the Euclidean distance, which negates any requirement of special data encoding.
7881

82+
The metrics are divided into three categories, utility, privacy and fairness, and the results are reported in a standardised format with an average value and an error estimate (where applicable). In addition to using the default preset files, users can select specific metrics and options by using the metric keywords in the evaluate function. The following table gives an overview of the available metrics, their keywords, and links to documentation for each metric.
83+
7984
### Utility Metrics
8085
Utility analysis entails resemblance, quality and usability metrics testing how well the synthetic data looks like, behaves like, and substitutes like the real data.
8186

82-
In the code we implemented:
83-
- Dimension-Wise Means (nums. only, avg. value and plot)
84-
- Principal Components Analysis (nums. only, plot of first two components)
85-
- Confidence Interval Overlap (nums. only, number and fraction of significant tests)
86-
- Correlation Matrix Difference (mixed correlation)
87-
- Mutual Information Matrix Difference
88-
- Kolmogorov–Smirnov / Total Variation Distance test (avg. distance, avg. p-value and number and fraction of significant tests)
89-
- Hellinger Distance (avg. distance)
90-
- Propensity Mean Squared Error (pMSE and accuracy)
91-
- Prediction AUROC difference (for binary target variables only)
92-
- Nearest Neighbour Adversarial Accuracy (NNAA)
93-
94-
### classification accuracy
95-
In this tool we test useability by training four different <code>sklearn</code> classifiers on real and synthetic data with 5-fold cross-validation (testing both models on the real validation fold).
96-
- DecisionTreeClassifier
97-
- AdaBoostClassifier
98-
- RandomForestClassifier
99-
- LogisticRegression
100-
101-
The average accuracy is reported together with the accuracy difference from models trained on real and synthetic data. If a test set is provided, the classifiers are also trained once on the entire training set, and again the accuracy and accuracy differences are reported, but now on the test data.
102-
103-
By default the results are given in terms of accuracy (micro F1 scores). To change, use ‘micro’, ‘macro’ or ‘weighted’ in the preset file or in kwargs.
87+
| keyword | metric name | link to docs | description |
88+
| --- | --- | --- | --- |
89+
| `dwm` | Dimension-Wise Means | [DimensionWiseMeans](src\syntheval\metrics\utility\metric_dimensionwise_means.py) | nums. only, avg. value and plot |
90+
| `pca` | Principal Components Analysis | [PrincipalComponentsAnalysis](src\syntheval\metrics\utility\metric_principal_component_analysis.py) | [Text Documentation](guides/metrics_references.md#Inter-Dataset-Similarity-Metric-Based-on-PCA) |
91+
| `cio` | Confidence Interval Overlap | [ConfidenceIntervalOverlap](src\syntheval\metrics\utility\metric_confidence_interval_overlap.py) | nums. only, number and fraction of significant tests |
92+
| `corr_diff` | Correlation Matrix Difference | [MixedCorrelation](src\syntheval\metrics\utility\metric_mixed_correlation.py) | mixed correlation |
93+
| `mi_diff` | Mutual Information Matrix Difference | [MutualInformation](src\syntheval\metrics\utility\metric_mutual_information.py) | mixed correlation |
94+
| `ks_test` | Kolmogorov–Smirnov / Total Variation Distance test | [KolmogorovSmirnov](src\syntheval\metrics\utility\metric_kolmogorov_smirnov.py) | avg. distance, avg. p-value and number and fraction of significant tests |
95+
| `h_dist` | Hellinger Distance | [HellingerDistance](src\syntheval\metrics\utility\metric_hellinger_distance.py) | avg. distance |
96+
| `p_MSE` | Propensity Mean Squared Error | [PropensityMeanSquaredError](src\syntheval\metrics\utility\metric_propensity_mse.py) | pMSE and accuracy |
97+
| `auroc_diff` | Prediction AUROC difference | [PredictionAUROCDifference](src\syntheval\metrics\utility\metric_auroc_difference.py) | for binary target variables only |
98+
| `cls_acc`| Classification Accuracy | [ClassificationAccuracy](src\syntheval\metrics\utility\metric_accuracy_difference.py) | avg. TRTR, TSTR across four classifiers, with optional holdout data and 5-fold cross-validation |
99+
| `fio` | Feature Importance Overlap | [FeatureImportanceOverlap](src\syntheval\metrics\utility\metric_feature_importance_overlap.py) | [Text Documentation](guides/metrics_references.md#Feature-Importance-Overlap-(FIO)) |
100+
| `nnaa` | Nearest Neighbour Adversarial Accuracy | [NearestNeighbourAdversarialAccuracy](src\syntheval\metrics\privacy\metric_nn_adversarial_accuracy.py) | avg. NNAA across all records |
101+
| `q_mse` | Quantile MSE | [QuantileMSE](src\syntheval\metrics\utility\metric_quantile_mse.py) | [Text Documentation](guides/metrics_references.md#Quantile-MSE) |
102+
| `mmd` | Maximum Mean Discrepancy (MMD) | [MaximumMeanDiscrepancy](src\syntheval\metrics\utility\metric_max_mean_discrepancy.py) | [Text Documentation](guides/metrics_references.md#Maximum-Mean-Discrepancy-(MMD)) |
103+
104104

105105
### Privacy Metrics
106106
Privacy is a crucial aspect of evaluating synthetic data, we include only three highlevel metrics with more to be added in the future.
107-
- Nearest Neighbour Distance Ratio (NNDR)
108-
- Privacy Losses (difference in NNAA and NNDR between test and training sets, good for checking overfitting too.)
109-
- Median Distance to Closest Record (normalised by internal NN distance.)
110-
- Hitting Rate (for numericals defined to be within the attribute range / 30)
111-
- Epsilon identifiability risk (calculated using weighted NN distance)
112-
- Membership Inference Attack
113-
- Attribute Disclosure Risk (with or without holdout data)
107+
108+
| keyword | metric name | link to docs | description |
109+
| --- | --- | --- | --- |
110+
| `nndr` | Nearest Neighbour Distance Ratio | [NearestNeighbourDistanceRatio](src\syntheval\metrics\privacy\metric_nn_distance_ratio.py) | avg. NNDR across all records |
111+
| `dcr` | Median Distance to Closest Record | [MedianDistanceToClosestRecord](src\syntheval\metrics\privacy\metric_distance_closest_record.py) | normalised by internal NN distance |
112+
| `hit_rate` | Hitting Rate | [HittingRate](src\syntheval\metrics\privacy\metric_hitting_rate.py) | hits on numericals are within attribute range / 30 |
113+
| `eps_risk` | Epsilon Identifiability Risk | [EpsilonIdentifiabilityRisk](src\syntheval\metrics\privacy\metric_epsilon_identifiability.py) | calculated using weighted NN distance |
114+
| `mia` | Membership Inference Attack | [MIAClassifier](src\syntheval\metrics\privacy\metric_MIA_classification.py) | worst case adversarial knowledge attack |
115+
| `att_discl` | Attribute Disclosure Risk | [AttributeDisclosure](src\syntheval\metrics\privacy\metric_AttrDis.py) | with or without holdout data |
114116

115117
### Fairness Metrics
116118
Fairness is an emerging property of synthetic data, we recently added support to evaluate this aspect, and include for now:
117-
- Statistical Parity Difference (Also known as Demographic Parity)
118119

120+
| keyword | metric name | link to docs | description |
121+
| --- | --- | --- | --- |
122+
| `statistical_parity` | Statistical Parity Difference | [StatisticalParity](src\syntheval\metrics\fairness\metric_statistical_parity.py) | also known as Demographic Parity |
119123

120124
## Creating new metrics
121125
SynthEval is designed with modularity in mind. Creating new, custom metrics is as easy as copying the [metrics template file](https://github.com/schneiderkamplab/syntheval/blob/main/src/syntheval/metrics/metric_template.py), and filling in the five required functions. Because SynthEval has very little hardcoding wrt. the metrics, making new metrics work locally should require no changes other than adding the metrics script in the metrics folder.
Lines changed: 30 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,4 +26,33 @@ $$
2626
where $x_j$ is the estimated probability in each of the $N_{quant}$ quantiles. The metric is calculated for each column in the data, and the average is taken over all columns. The metric is used to evaluate the distribution of the synthetic data, and a low value indicates that the synthetic data is a good representation of the real data.
2727

2828
Reference:
29-
> Butter, A., Diefenbacher, S., Kasieczka, G., Nachman, B., & Plehn, T. (2021). GANplifying event samples. SciPost Physics, 10(6), 139. [10.21468/SciPostPhys.10.6.139](https://doi.org/10.21468/SciPostPhys.10.6.139)
29+
> Butter, A., Diefenbacher, S., Kasieczka, G., Nachman, B., & Plehn, T. (2021). GANplifying event samples. SciPost Physics, 10(6), 139. [10.21468/SciPostPhys.10.6.139](https://doi.org/10.21468/SciPostPhys.10.6.139)
30+
31+
### Maximum Mean Discrepancy (MMD)
32+
MMD is a kernel-based distance measure between two distributions (in our case the real and synthetic data sets). It comes in two flavors: the biased V-statistic and the unbiased U-statistic. The biased V-statistic is calculated as follows:
33+
34+
$$
35+
\text{MMD}_b^2 = \frac{1}{n^2} \sum_{i,j} k(x_i, x_j) + \frac{1}{m^2} \sum_{i,j} k(y_i, y_j) - \frac{2}{nm} \sum_{i,j} k(x_i, y_j),
36+
$$
37+
38+
where $k$ is a kernel function, and $x_i$ and $y_j$ are samples from the two distributions. The unbiased U-statistic is calculated as follows:
39+
40+
$$
41+
\text{MMD}_u^2 = \frac{1}{n(n-1)} \sum_{i \neq j} k(x_i, x_j) + \frac{1}{m(m-1)} \sum_{i \neq j} k(y_i, y_j) - \frac{2}{nm} \sum_{i,j} k(x_i, y_j),
42+
$$
43+
44+
where the sums are taken over all pairs of samples, excluding the diagonal terms. Both measures can be negative at finite sample sizes due to variance, so it is clipped at $0$ for stability, i.e., $\max(MMD^2,0)$. A lower value of MMD indicates that the two distributions are similar (yet negative values cannot be compared by magnitude).
45+
46+
As a test statistic, MMD can be used to perform a two-sample test to determine if the two distributions are significantly different. In this case, a value higher than some threshold (determined by the distribution of MMD under the null hypothesis) would indicate that the two distributions are significantly different. In the context of synthetic data evaluation, a low MMD value would indicate that the synthetic data is not unreasonably different.
47+
48+
Reference:
49+
> Gretton, A., Borgwardt, K.M., Rasch, M.J., Smola, A., Schölkopf, B., & Smola, A. (2012). A Kernel Two-Sample Test. Journal of Machine Learning Research, 13(25), 723–773. [http://jmlr.org/papers/v13/gretton12a.html](http://jmlr.org/papers/v13/gretton12a.html)
50+
51+
### Feature Importance Overlap (FIO)
52+
FIO is a generic metric that measures the overlap in top-k selected features between a model trained on real data and a model trained on synthetic data in predicting the target analysis variable. The metric is calculated as follows:
53+
54+
$$
55+
\text{FIO}_k = \frac{|\text{Top-$k$ features from real data} \cap \text{Top-$k$ features from synthetic data}|}{k},
56+
$$
57+
58+
In the actual implementation we select top-5%, 10%, 25%, and 50% of the features (where possible), and return all successful results. A higher value of FIO indicates that the synthetic data is a good representation of the real data in terms of feature selection. However, for the lowest top-k selections (e.g., 5% or 10%), it should be *very close* to 1 for a good synthetic data set, while for higher top-k selections (e.g., 25% or 50%) it can be more divergent and still indicate a good synthetic data set.

0 commit comments

Comments
 (0)