synthesize

Generate a synthetic CSV that is statistically faithful to a source CSV. Runs stats + frequency on the source so synthesized columns reproduce its per-column attributes — frequency-weighted sampling for categorical columns, quartile-bucketed numeric/date generation, null-ratio preservation. With a Data Dictionary from describegpt --dictionary --infer-content-type, semantic Content Types pick realistic fake-rs fakers (names, emails, addresses, UUIDs, etc.) for non-enumerable columns. A dictionary relationships array preserves inter-column structure within each row — joint (functional dependencies like city/state/zip), ordered (monotonic chains like created_date ≤ closed_date) and correlated (numeric correlation via a Gaussian copula). Fully reproducible with --seed.

Table of Contents | Source: src/cmd/synthesize/mod.rs | 📇🎲🤖

Description | Examples | Usage | Synthesize Options | Common Options

Description ↩

Generates a synthetic CSV that is statistically faithful to a source CSV.

synthesize analyzes with stats and frequency, then emits N rows of fake data that reproduce the source's per-column attributes:

Categorical / low-cardinality columns are reproduced by frequency-weighted sampling of their real value set — cardinality, weights and repetition structure are preserved exactly.
Numeric and date/datetime columns are reproduced with quartile buckets, so the shape of the distribution (not just its [min,max] range) is preserved.
Null ratios are reproduced per column.

When a Data Dictionary is supplied (via --dictionary, or generated on the fly with --infer-content-type), each column's semantic Content Type picks a realistic faker (names, emails, addresses, UUIDs, etc.) for columns that are NOT fully enumerated by frequency. For bounded-cardinality faker columns (cardinality < requested rows and below an internal cap of 100,000), a fixed pool of distinct fake values is pre-generated and sampled from, so the column's cardinality is preserved. For very high cardinality columns above this cap, a fresh fake value is generated per row instead — distinct count is approximate in that case.

When stats provides string-length statistics (min_length / max_length / avg_length / stddev_length) AND the column is routed to an unstructured text generator (lorem_*, free_text, or the no-faker fallback), synthesized values are truncated so their character lengths follow Normal(avg_length, stddev_length) clamped to [min_length, max_length]. This applies to unstructured pooled values as well — a low-cardinality free-text column still gets its generated pool entries truncated. Structured semantic fakers (email, name, uuid, phone, address parts, etc.) ignore these stats — truncating them would corrupt their format, so their pools are reproduced verbatim. Frequency- enumerated values are always reproduced verbatim and are never truncated.

When the Data Dictionary declares relationships, the named columns are generated jointly so inter-column structure survives into each output row:

joint — categorical / functional-dependency groups (e.g. city/state/zip). Whole value-tuples are sampled from the source by frequency, so only real co-occurring combinations are emitted.
ordered — columns that must keep a monotonic order within a row (e.g. created_date <= closed_date). The anchor column is generated from its own distribution; each later column is the anchor plus a non-negative gap drawn from the gap distribution learned from the source.
correlated — numeric columns whose correlation should be preserved. A Gaussian copula couples the columns while leaving each column's own distribution unchanged.

Relationships are read from the dictionary's relationships array — inferred by describegpt or hand-authored. Columns not named by any relationship are still generated independently. Pass --no-relationships to disable relationship modeling entirely.

With --seed, output is fully reproducible.

Examples ↩

Pure statistical synthesis — no dictionary needed

qsv synthesize data.csv -n 1000 --seed 42 > synthetic.csv

First, generate the Data Dictionary with describegpt

qsv describegpt data.csv --dictionary --infer-content-type --format JSON -o dict.json

Then layer in semantic fakers from the dictionary

qsv synthesize data.csv --dictionary dict.json -n 1000 > synthetic.csv

Let synthesize build the dictionary itself (needs an LLM API key)

qsv synthesize data.csv --infer-content-type -n 1000 > synthetic.csv

Preserve inter-column relationships declared in the dictionary (e.g. city/state/zip, created_date <= closed_date)

qsv synthesize data.csv --dictionary dict.json -n 1000 > synthetic.csv

For more examples, see tests.

Usage ↩

qsv synthesize [options] <input>
qsv synthesize --help

Synthesize Options ↩

Option	Type	Description	Default
`‑‑dictionary`	string	Data Dictionary JSON file produced by `describegpt --dictionary --infer-content-type --format JSON`. Layers semantic Content Types onto generation. If omitted, generation is purely type/frequency-based.
`‑‑infer‑content‑type`	flag	Generate the Data Dictionary on the fly by invoking `describegpt --dictionary --infer-content-type` on . Requires an LLM API key (QSV_LLM_APIKEY). Ignored if --dictionary is given.
`‑n,` `‑‑rows`	integer	Number of synthetic rows to generate.	`100`
`‑‑seed`	integer	RNG seed for fully reproducible output.
`‑‑locale`	string	Locale for faker-backed columns. Case-insensitive. Supported: en, fr_fr, de_de, it_it, pt_br, pt_pt, ja_jp, zh_cn, zh_tw, ar_sa, cy_gb, fa_ir, nl_nl, tr_tr. Sparse locales (those without per-category data in fake-rs) silently fall back to en data for the missing categories — e.g. lorem text under a non-en locale is still English, since only zh_cn has localized lorem data.	`en`
`‑‑freq‑limit`	integer	Frequency pool depth passed to the internal `frequency` run as --limit. A column is reproduced via exact frequency-weighted sampling only when its cardinality is fully captured within this limit; higher values reproduce more columns verbatim. 0 means unlimited.	`100`
`‑‑stats‑options`	string	Extra options appended to the internal `stats` run. Note: cardinality, quartiles and date inference are always enabled — do not re-specify them here.
`‑‑consistent‑fakes`	flag	For structured-faker columns with bounded cardinality (cardinality fully captured by `frequency`), build a stable source-value -> fake-value mapping so the same source value always produces the same fake in the output. Preserves the source frequency distribution and overrides the default "emit real values when frequency-enumerated" behavior for structured fakers (names, emails, addresses, etc.). Has no effect on unstructured columns (lorem_*, free_text, unknown), all-unique columns, or non-faker columns. Useful for deidentified synthesis where you want stable joins on the faked columns.
`‑‑no‑relationships`	flag	Disable inter-column relationship modeling. Every column is generated independently even when the dictionary declares a `relationships` array.
`‑‑joint‑cardinality‑cap`	integer	Maximum number of distinct value-tuples a `joint` relationship may have. A joint group above this cap falls back to independent generation (or aborts under --strict-relationships). 0 means unlimited.	`100000`
`‑‑correlation‑threshold`	float	Minimum absolute Spearman correlation for a pair of columns to stay in a `correlated` relationship. Weakly-correlated members are dropped.	`0.3`
`‑‑strict‑relationships`	flag	Abort instead of warning-and-degrading when a declared relationship fails validation.
`‑j,` `‑‑jobs`	integer	Number of jobs to use for the internal `stats` and `frequency` runs.

Common Options ↩

Option	Type	Description
`‑h,` `‑‑help`	flag	Display this message
`‑o,` `‑‑output`	string	Write output to instead of stdout.
`‑d,` `‑‑delimiter`	string	The field delimiter for reading the input CSV. Must be a single character. (default: ,)

Source: src/cmd/synthesize/mod.rs | Table of Contents | README

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

synthesize

Description ↩

Examples ↩

Usage ↩

Synthesize Options ↩

Common Options ↩

FilesExpand file tree

synthesize.md

Latest commit

History

synthesize.md

File metadata and controls

synthesize

Description ↩

Examples ↩

Usage ↩

Synthesize Options ↩

Common Options ↩