Skip to content

Latest commit

 

History

History
151 lines (112 loc) · 9.71 KB

File metadata and controls

151 lines (112 loc) · 9.71 KB

synthesize

Generate a synthetic CSV that is statistically faithful to a source CSV. Runs stats + frequency on the source so synthesized columns reproduce its per-column attributes — frequency-weighted sampling for categorical columns, quartile-bucketed numeric/date generation, null-ratio preservation. With a Data Dictionary from describegpt --dictionary --infer-content-type, semantic Content Types pick realistic fake-rs fakers (names, emails, addresses, UUIDs, etc.) for non-enumerable columns. A dictionary relationships array preserves inter-column structure within each row — joint (functional dependencies like city/state/zip), ordered (monotonic chains like created_date ≤ closed_date) and correlated (numeric correlation via a Gaussian copula). Fully reproducible with --seed.

Table of Contents | Source: src/cmd/synthesize/mod.rs | 📇🎲🤖

Description | Examples | Usage | Synthesize Options | Common Options

Description

Generates a synthetic CSV that is statistically faithful to a source CSV.

synthesize analyzes with stats and frequency, then emits N rows of fake data that reproduce the source's per-column attributes:

  • Categorical / low-cardinality columns are reproduced by frequency-weighted sampling of their real value set — cardinality, weights and repetition structure are preserved exactly.
  • Numeric and date/datetime columns are reproduced with quartile buckets, so the shape of the distribution (not just its [min,max] range) is preserved.
  • Null ratios are reproduced per column.

When a Data Dictionary is supplied (via --dictionary, or generated on the fly with --infer-content-type), each column's semantic Content Type picks a realistic faker (names, emails, addresses, UUIDs, etc.) for columns that are NOT fully enumerated by frequency. For bounded-cardinality faker columns (cardinality < requested rows and below an internal cap of 100,000), a fixed pool of distinct fake values is pre-generated and sampled from, so the column's cardinality is preserved. For very high cardinality columns above this cap, a fresh fake value is generated per row instead — distinct count is approximate in that case.

When stats provides string-length statistics (min_length / max_length / avg_length / stddev_length) AND the column is routed to an unstructured text generator (lorem_*, free_text, or the no-faker fallback), synthesized values are truncated so their character lengths follow Normal(avg_length, stddev_length) clamped to [min_length, max_length]. This applies to unstructured pooled values as well — a low-cardinality free-text column still gets its generated pool entries truncated. Structured semantic fakers (email, name, uuid, phone, address parts, etc.) ignore these stats — truncating them would corrupt their format, so their pools are reproduced verbatim. Frequency- enumerated values are always reproduced verbatim and are never truncated.

When the Data Dictionary declares relationships, the named columns are generated jointly so inter-column structure survives into each output row:

  • joint — categorical / functional-dependency groups (e.g. city/state/zip). Whole value-tuples are sampled from the source by frequency, so only real co-occurring combinations are emitted.
  • ordered — columns that must keep a monotonic order within a row (e.g. created_date <= closed_date). The anchor column is generated from its own distribution; each later column is the anchor plus a non-negative gap drawn from the gap distribution learned from the source.
  • correlated — numeric columns whose correlation should be preserved. A Gaussian copula couples the columns while leaving each column's own distribution unchanged.

Relationships are read from the dictionary's relationships array — inferred by describegpt or hand-authored. Columns not named by any relationship are still generated independently. Pass --no-relationships to disable relationship modeling entirely.

With --seed, output is fully reproducible.

Examples

Pure statistical synthesis — no dictionary needed

qsv synthesize data.csv -n 1000 --seed 42 > synthetic.csv

First, generate the Data Dictionary with describegpt

qsv describegpt data.csv --dictionary --infer-content-type --format JSON -o dict.json

Then layer in semantic fakers from the dictionary

qsv synthesize data.csv --dictionary dict.json -n 1000 > synthetic.csv

Let synthesize build the dictionary itself (needs an LLM API key)

qsv synthesize data.csv --infer-content-type -n 1000 > synthetic.csv

Preserve inter-column relationships declared in the dictionary (e.g. city/state/zip, created_date <= closed_date)

qsv synthesize data.csv --dictionary dict.json -n 1000 > synthetic.csv

For more examples, see tests.

See also https://github.com/dathere/qsv/wiki/AI-and-Documentation#synthesize

Usage

qsv synthesize [options] <input>
qsv synthesize --help

Synthesize Options

         Option           Type Description Default
 ‑‑dictionary  string Data Dictionary JSON file produced by describegpt --dictionary --infer-content-type --format JSON. Layers semantic Content Types onto generation. If omitted, generation is purely type/frequency-based.
 ‑‑infer‑content‑type  flag Generate the Data Dictionary on the fly by invoking describegpt --dictionary --infer-content-type on . Requires an LLM API key (QSV_LLM_APIKEY). Ignored if --dictionary is given.
 ‑n,
‑‑rows 
integer Number of synthetic rows to generate. 100
 ‑‑seed  integer RNG seed for fully reproducible output.
 ‑‑locale  string Locale for faker-backed columns. Case-insensitive. Supported: en, fr_fr, de_de, it_it, pt_br, pt_pt, ja_jp, zh_cn, zh_tw, ar_sa, cy_gb, fa_ir, nl_nl, tr_tr. Sparse locales (those without per-category data in fake-rs) silently fall back to en data for the missing categories — e.g. lorem text under a non-en locale is still English, since only zh_cn has localized lorem data. en
 ‑‑freq‑limit  integer Frequency pool depth passed to the internal frequency run as --limit. A column is reproduced via exact frequency-weighted sampling only when its cardinality is fully captured within this limit; higher values reproduce more columns verbatim. 0 means unlimited. 100
 ‑‑stats‑options  string Extra options appended to the internal stats run. Note: cardinality, quartiles and date inference are always enabled — do not re-specify them here.
 ‑‑consistent‑fakes  flag For structured-faker columns with bounded cardinality (cardinality fully captured by frequency), build a stable source-value -> fake-value mapping so the same source value always produces the same fake in the output. Preserves the source frequency distribution and overrides the default "emit real values when frequency-enumerated" behavior for structured fakers (names, emails, addresses, etc.). Has no effect on unstructured columns (lorem_*, free_text, unknown), all-unique columns, or non-faker columns. Useful for deidentified synthesis where you want stable joins on the faked columns.
 ‑‑no‑relationships  flag Disable inter-column relationship modeling. Every column is generated independently even when the dictionary declares a relationships array.
 ‑‑joint‑cardinality‑cap  integer Maximum number of distinct value-tuples a joint relationship may have. A joint group above this cap falls back to independent generation (or aborts under --strict-relationships). 0 means unlimited. 100000
 ‑‑correlation‑threshold  float Minimum absolute Spearman correlation for a pair of columns to stay in a correlated relationship. Weakly-correlated members are dropped. 0.3
 ‑‑strict‑relationships  flag Abort instead of warning-and-degrading when a declared relationship fails validation.
 ‑j,
‑‑jobs 
integer Number of jobs to use for the internal stats and frequency runs.

Common Options

     Option      Type Description Default
 ‑h,
‑‑help 
flag Display this message
 ‑o,
‑‑output 
string Write output to instead of stdout.
 ‑d,
‑‑delimiter 
string The field delimiter for reading the input CSV. Must be a single character. (default: ,)

Source: src/cmd/synthesize/mod.rs | Table of Contents | README