Skip to content

Commit 0e00fc2

Browse files
authored
Merge pull request #363 from posit-dev/feat-data-gen-country-mixing
feat: allow for country data mixing in `generate_dataset()`
2 parents 6f7ef86 + 7b63034 commit 0e00fc2

23 files changed

Lines changed: 5679 additions & 5051 deletions

docs/user-guide/test-data-generation.qmd

Lines changed: 56 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -369,6 +369,57 @@ You can use either ISO 3166-1 alpha-2 codes (e.g., `"US"`) or alpha-3 codes (e.g
369369

370370
Additional countries and expanded coverage are planned for future releases.
371371

372+
### Mixing Multiple Countries
373+
374+
When you need test data that spans multiple locales (e.g., simulating an international customer
375+
base), you can pass a list or dict to the `country=` parameter instead of a single string.
376+
377+
Passing a list of country codes splits rows equally across those countries. Here, 200 rows are
378+
divided evenly among the US, Germany, and Japan (~67 each):
379+
380+
```{python}
381+
schema = pb.Schema(
382+
name=pb.string_field(preset="name"),
383+
city=pb.string_field(preset="city"),
384+
postcode=pb.string_field(preset="postcode"),
385+
)
386+
387+
pb.preview(pb.generate_dataset(schema, n=200, seed=23, country=["US", "DE", "JP"]))
388+
```
389+
390+
To control the proportion of rows per country, pass a dict mapping country codes to weights. The
391+
following generates 200 rows with 70% from the US, 20% from Germany, and 10% from France:
392+
393+
```{python}
394+
pb.preview(
395+
pb.generate_dataset(
396+
schema, n=200, seed=23,
397+
country={"US": 0.7, "DE": 0.2, "FR": 0.1},
398+
)
399+
)
400+
```
401+
402+
Weights are auto-normalized, so `{"US": 7, "DE": 2, "FR": 1}` is equivalent to the example above.
403+
Row counts are allocated using largest-remainder apportionment, ensuring they always sum to exactly
404+
`n`.
405+
406+
By default, rows from different countries are interleaved randomly (`shuffle=True`). Set
407+
`shuffle=False` to keep rows grouped by country in the order the countries are listed:
408+
409+
```{python}
410+
pb.preview(
411+
pb.generate_dataset(
412+
schema, n=120, seed=23,
413+
country=["US", "DE", "JP"], shuffle=False,
414+
)
415+
)
416+
```
417+
418+
All coherence systems (address, person, business) work correctly within each country's batch of
419+
rows. A French row will have a French name with a matching French email; a Japanese row will have a
420+
Japanese name with a matching Japanese email. Non-preset columns (integers, floats, booleans, dates)
421+
are generated independently for each batch but still respect their field constraints.
422+
372423
## Output Formats
373424

374425
The `generate_dataset()` function supports multiple output formats via the `output=` parameter,
@@ -381,20 +432,20 @@ schema = pb.Schema(
381432
)
382433
```
383434

384-
The default output is a **Polars DataFrame**, which offers excellent performance and a modern API
385-
for data manipulation:
435+
The default output is a Polars DataFrame, which offers excellent performance and a modern API for
436+
data manipulation:
386437

387438
```{python}
388-
# Polars DataFrame (default)
389439
polars_df = pb.generate_dataset(schema, n=100, seed=23, output="polars")
440+
390441
pb.preview(polars_df)
391442
```
392443

393444
If your workflow uses Pandas, simply specify `output="pandas"` to get a **Pandas DataFrame**:
394445

395446
```{python}
396-
# Pandas DataFrame
397447
pandas_df = pb.generate_dataset(schema, n=100, seed=23, output="pandas")
448+
398449
pb.preview(pandas_df)
399450
```
400451

@@ -592,7 +643,7 @@ By incorporating test data generation into your process, you can:
592643
- create reproducible test fixtures for automated testing and CI/CD pipelines
593644
- generate locale-specific data for internationalization testing across 55 countries
594645
- ensure coherent relationships between related fields like names, emails, addresses, jobs, and
595-
license plates
646+
license plates
596647
- produce datasets of any size with consistent, realistic values
597648

598649
Whether you're building validation logic, testing data pipelines, or simply need sample data for

pointblank/countries/__init__.py

Lines changed: 9 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,3 @@
1-
"""
2-
Country-based data generation for synthetic test data.
3-
4-
This module provides country-specific data generation without external dependencies.
5-
It supports generating realistic names, addresses, emails, and other data types
6-
with proper localization based on ISO 3166-1 country codes.
7-
"""
8-
91
from __future__ import annotations
102

113
import json
@@ -587,9 +579,8 @@ def seed(self, seed: int) -> None:
587579
def _get_person(self, gender: str | None = None) -> dict[str, str]:
588580
"""Get a coherent person (first_name, last_name, gender) from the data.
589581
590-
If person data has ``ethnic_groups``, picks a group first (weighted by population
591-
share) then draws first and last names from within that group so they remain
592-
ethnically coherent.
582+
If person data has `ethnic_groups`, picks a group first (weighted by population share) then
583+
draws first and last names from within that group so they remain ethnically coherent.
593584
"""
594585
# If no gender specified, randomly select one (weighted toward male/female)
595586
if gender is None:
@@ -678,8 +669,8 @@ def _generate_first_name(self, gender: str | None = None) -> str:
678669
def _generate_last_name(self, gender: str | None = None) -> str:
679670
"""Generate a random last name (internal, no caching).
680671
681-
If last_names is a dict with 'male'/'female' keys (e.g., IS patronymics),
682-
picks from the gender-appropriate list.
672+
If last_names is a dict with 'male'/'female' keys (e.g., IS patronymics), picks from the
673+
gender-appropriate list.
683674
"""
684675
names = self._data.person.get("last_names", ["Smith"])
685676

@@ -702,9 +693,9 @@ def init_row_persons(self, n_rows: int) -> None:
702693
"""
703694
Pre-generate person data for multiple rows to ensure coherence across columns.
704695
705-
This should be called before generating a dataset with person-related columns.
706-
When active, first_name(), last_name(), name(), email() will use the person
707-
for the current row (set via set_row()).
696+
This should be called before generating a dataset with person-related columns. When active,
697+
`first_name()`, `last_name()`, `name()`, `email()` will use the person for the current row
698+
(set via `set_row()`).
708699
709700
Parameters
710701
----------
@@ -721,8 +712,8 @@ def new_person(self, gender: str | None = None) -> dict[str, str]:
721712
"""
722713
Select a new random person and cache it for coherent generation.
723714
724-
Call this before generating related person components (first_name, last_name, email)
725-
to ensure they all refer to the same person.
715+
Call this before generating related person components (first_name, last_name, email) to
716+
ensure they all refer to the same person.
726717
727718
Returns
728719
-------

pointblank/countries/data/AT/address.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1569,7 +1569,7 @@
15691569
"{street} {building_number}, {postcode} {city}",
15701570
"{street} {building_number}/{unit}, {postcode} {city}"
15711571
],
1572-
"country": "Österreich",
1572+
"country": "Austria",
15731573
"country_code": "AT",
15741574
"phone_area_codes": {
15751575
"Wien": [

pointblank/countries/data/BR/address.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1748,7 +1748,7 @@
17481748
"{street}, {building_number}, {postcode} {city} - {state_abbr}",
17491749
"{street}, {building_number}, Apto {unit}, {postcode} {city} - {state_abbr}"
17501750
],
1751-
"country": "Brasil",
1751+
"country": "Brazil",
17521752
"country_code": "BR",
17531753
"phone_area_codes": {
17541754
"São Paulo": [

pointblank/countries/data/CH/address.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3063,7 +3063,7 @@
30633063
"{street} {building_number}, CH-{postcode} {city}",
30643064
"{street} {building_number}, {postcode} {city} ({state})"
30653065
],
3066-
"country": "Schweiz",
3066+
"country": "Switzerland",
30673067
"country_code": "CH",
30683068
"phone_area_codes": {
30693069
"Zürich": [

pointblank/countries/data/DE/address.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4324,7 +4324,7 @@
43244324
"{street} {building_number}, {postcode} {city}",
43254325
"{street} {building_number}, Whg. {unit}, {postcode} {city}"
43264326
],
4327-
"country": "Deutschland",
4327+
"country": "Germany",
43284328
"country_code": "DE",
43294329
"phone_area_codes": {
43304330
"Berlin": [

pointblank/countries/data/ES/address.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4002,7 +4002,7 @@
40024002
"{street}, {building_number}, {unit}º, {postcode} {city}",
40034003
"{street}, {building_number}, Piso {unit}, {postcode} {city}, {state}"
40044004
],
4005-
"country": "España",
4005+
"country": "Spain",
40064006
"country_code": "ES",
40074007
"phone_area_codes": {
40084008
"Comunidad de Madrid": [

pointblank/countries/data/FI/address.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1568,7 +1568,7 @@
15681568
"{street} {building_number}, {postcode} {city}",
15691569
"{street} {building_number} {unit}, {postcode} {city}"
15701570
],
1571-
"country": "Suomi",
1571+
"country": "Finland",
15721572
"country_code": "FI",
15731573
"phone_area_codes": {
15741574
"Uusimaa": [

pointblank/countries/data/HR/address.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -849,7 +849,7 @@
849849
"{street} {building_number}, {city}",
850850
"{city}, {street} {building_number}"
851851
],
852-
"country": "Hrvatska",
852+
"country": "Croatia",
853853
"country_code": "HR",
854854
"phone_area_codes": {
855855
"Grad Zagreb": [

pointblank/countries/data/IT/address.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3476,7 +3476,7 @@
34763476
"{street}, {building_number}, {postcode} {city} ({state})",
34773477
"{street} {building_number}, {postcode} {city}"
34783478
],
3479-
"country": "Italia",
3479+
"country": "Italy",
34803480
"country_code": "IT",
34813481
"phone_area_codes": {
34823482
"Lazio": [

0 commit comments

Comments
 (0)