Skip to content

Commit 7b63034

Browse files
committed
Add 'Mixing Multiple Countries' guidance
1 parent ded5e66 commit 7b63034

1 file changed

Lines changed: 56 additions & 5 deletions

File tree

docs/user-guide/test-data-generation.qmd

Lines changed: 56 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -369,6 +369,57 @@ You can use either ISO 3166-1 alpha-2 codes (e.g., `"US"`) or alpha-3 codes (e.g
369369

370370
Additional countries and expanded coverage are planned for future releases.
371371

372+
### Mixing Multiple Countries
373+
374+
When you need test data that spans multiple locales (e.g., simulating an international customer
375+
base), you can pass a list or dict to the `country=` parameter instead of a single string.
376+
377+
Passing a list of country codes splits rows equally across those countries. Here, 200 rows are
378+
divided evenly among the US, Germany, and Japan (~67 each):
379+
380+
```{python}
381+
schema = pb.Schema(
382+
name=pb.string_field(preset="name"),
383+
city=pb.string_field(preset="city"),
384+
postcode=pb.string_field(preset="postcode"),
385+
)
386+
387+
pb.preview(pb.generate_dataset(schema, n=200, seed=23, country=["US", "DE", "JP"]))
388+
```
389+
390+
To control the proportion of rows per country, pass a dict mapping country codes to weights. The
391+
following generates 200 rows with 70% from the US, 20% from Germany, and 10% from France:
392+
393+
```{python}
394+
pb.preview(
395+
pb.generate_dataset(
396+
schema, n=200, seed=23,
397+
country={"US": 0.7, "DE": 0.2, "FR": 0.1},
398+
)
399+
)
400+
```
401+
402+
Weights are auto-normalized, so `{"US": 7, "DE": 2, "FR": 1}` is equivalent to the example above.
403+
Row counts are allocated using largest-remainder apportionment, ensuring they always sum to exactly
404+
`n`.
405+
406+
By default, rows from different countries are interleaved randomly (`shuffle=True`). Set
407+
`shuffle=False` to keep rows grouped by country in the order the countries are listed:
408+
409+
```{python}
410+
pb.preview(
411+
pb.generate_dataset(
412+
schema, n=120, seed=23,
413+
country=["US", "DE", "JP"], shuffle=False,
414+
)
415+
)
416+
```
417+
418+
All coherence systems (address, person, business) work correctly within each country's batch of
419+
rows. A French row will have a French name with a matching French email; a Japanese row will have a
420+
Japanese name with a matching Japanese email. Non-preset columns (integers, floats, booleans, dates)
421+
are generated independently for each batch but still respect their field constraints.
422+
372423
## Output Formats
373424

374425
The `generate_dataset()` function supports multiple output formats via the `output=` parameter,
@@ -381,20 +432,20 @@ schema = pb.Schema(
381432
)
382433
```
383434

384-
The default output is a **Polars DataFrame**, which offers excellent performance and a modern API
385-
for data manipulation:
435+
The default output is a Polars DataFrame, which offers excellent performance and a modern API for
436+
data manipulation:
386437

387438
```{python}
388-
# Polars DataFrame (default)
389439
polars_df = pb.generate_dataset(schema, n=100, seed=23, output="polars")
440+
390441
pb.preview(polars_df)
391442
```
392443

393444
If your workflow uses Pandas, simply specify `output="pandas"` to get a **Pandas DataFrame**:
394445

395446
```{python}
396-
# Pandas DataFrame
397447
pandas_df = pb.generate_dataset(schema, n=100, seed=23, output="pandas")
448+
398449
pb.preview(pandas_df)
399450
```
400451

@@ -592,7 +643,7 @@ By incorporating test data generation into your process, you can:
592643
- create reproducible test fixtures for automated testing and CI/CD pipelines
593644
- generate locale-specific data for internationalization testing across 55 countries
594645
- ensure coherent relationships between related fields like names, emails, addresses, jobs, and
595-
license plates
646+
license plates
596647
- produce datasets of any size with consistent, realistic values
597648

598649
Whether you're building validation logic, testing data pipelines, or simply need sample data for

0 commit comments

Comments
 (0)