@@ -15744,7 +15744,7 @@ Generate synthetic test data based on schema definitions. Use
1574415744`generate_dataset()` to create data from a `Schema` object. The helper functions define typed fields
1574515745with constraints for realistic test data generation.
1574615746
15747- generate_dataset(schema: 'Schema', n: 'int' = 100, seed: 'int | None' = None, output: "Literal['polars', 'pandas', 'dict']" = 'polars', country: 'str' = 'US') -> 'Any'
15747+ generate_dataset(schema: 'Schema', n: 'int' = 100, seed: 'int | None' = None, output: "Literal['polars', 'pandas', 'dict']" = 'polars', country: 'str | list[str] | dict[str, float] ' = 'US', shuffle: 'bool' = True, weighted: 'bool' = True ) -> 'Any'
1574815748
1574915749 Generate synthetic test data from a schema.
1575015750
@@ -15769,10 +15769,22 @@ generate_dataset(schema: 'Schema', n: 'int' = 100, seed: 'int | None' = None, ou
1576915769 Polars DataFrame, (2) `"pandas"` returns a Pandas DataFrame, and (3) `"dict"` returns
1577015770 a dictionary of lists.
1577115771 country
15772- Country code for locale-aware generation when using presets. Accepts ISO 3166-1 alpha-2
15773- codes (e.g., `"US"`, `"DE"`, `"FR"`) or alpha-3 codes (e.g., `"USA"`, `"DEU"`, `"FRA"`).
15774- This affects the format and content of preset-generated data such as addresses, phone
15775- numbers, names, and postal codes. The default is `"US"`.
15772+ Country code(s) for locale-aware generation when using presets. Accepts a single
15773+ ISO 3166-1 alpha-2 or alpha-3 code (e.g., `"US"`, `"DEU"`), a list of codes for
15774+ uniform mixing (e.g., `["US", "DE", "JP"]`), or a dict mapping codes to positive
15775+ weights (e.g., `{"US": 60, "DE": 25, "JP": 15}`). See the *Locale Mixing* section
15776+ below for details. The default is `"US"`.
15777+ shuffle
15778+ When `country=` is a list or dict (multi-country mixing), controls whether rows from
15779+ different countries are interleaved randomly (`True`, the default) or grouped by country
15780+ in the order the countries are specified (`False`). Ignored when `country=` is a single
15781+ string.
15782+ weighted
15783+ When `True`, names and locations are sampled according to real-world frequency tiers.
15784+ Common names like "James" and "Smith" appear far more often than rare names. Large
15785+ cities like New York and Los Angeles dominate over small towns. Only affects data files
15786+ that have been migrated to the tiered format; flat-list data always uses uniform
15787+ sampling. Default is `True`.
1577615788
1577715789 Returns
1577815790 -------
@@ -15814,9 +15826,25 @@ generate_dataset(schema: 'Schema', n: 'int' = 100, seed: 'int | None' = None, ou
1581415826 share the same location (e.g., the city matches the address), and business-related presets
1581515827 will share the same industry context.
1581615828
15829+ Locale Mixing
15830+ -------------
15831+ The `country=` parameter accepts three input forms for flexible locale control:
15832+
15833+ (1) a **single string** (the default), such as `"US"` or `"DEU"`, which generates
15834+ all rows from one locale; (2) a **list of strings**, such as `["US", "DE", "JP"]`,
15835+ which splits rows equally across the listed countries; and (3) a **dict of weights**,
15836+ such as `{"US": 0.6, "DE": 0.3, "FR": 0.1}`, which allocates rows proportionally
15837+ (weights are auto-normalized, so `{"US": 6, "DE": 3, "FR": 1}` is equivalent).
15838+
15839+ Row counts are distributed using largest-remainder apportionment so they always sum
15840+ to exactly `n=`. Each country's rows are generated as an independent batch (preserving
15841+ all cross-column coherence within each batch), then either interleaved randomly
15842+ (`shuffle=True`, the default) or left in contiguous country blocks
15843+ (`shuffle=False`).
15844+
1581715845 Supported Countries
1581815846 -------------------
15819- The `country=` parameter currently supports 55 countries with full locale data:
15847+ The `country=` parameter currently supports 71 countries with full locale data:
1582015848
1582115849 **Europe (32 countries):** Austria (`"AT"`), Belgium (`"BE"`), Bulgaria (`"BG"`),
1582215850 Croatia (`"HR"`), Cyprus (`"CY"`), Czech Republic (`"CZ"`), Denmark (`"DK"`),
@@ -15827,16 +15855,20 @@ generate_dataset(schema: 'Schema', n: 'int' = 100, seed: 'int | None' = None, ou
1582715855 Slovakia (`"SK"`), Slovenia (`"SI"`), Spain (`"ES"`), Sweden (`"SE"`),
1582815856 Switzerland (`"CH"`), United Kingdom (`"GB"`)
1582915857
15830- **Americas (7 countries):** Argentina (`"AR"`), Brazil (`"BR"`), Canada (`"CA"`),
15831- Chile (`"CL"`), Colombia (`"CO"`), Mexico (`"MX"`), United States (`"US"`)
15858+ **Americas (9 countries):** Argentina (`"AR"`), Brazil (`"BR"`), Canada (`"CA"`),
15859+ Chile (`"CL"`), Colombia (`"CO"`), Costa Rica (`"CR"`), Mexico (`"MX"`),
15860+ Peru (`"PE"`), United States (`"US"`)
1583215861
15833- **Asia-Pacific (12 countries):** Australia (`"AU"`), China (`"CN"`), Hong Kong (`"HK"`),
15834- India (`"IN"`), Indonesia (`"ID"`), Japan (`"JP"`), New Zealand (`"NZ"`),
15835- Philippines (`"PH"`), Singapore (`"SG"`), South Korea (`"KR"`), Taiwan (`"TW"`),
15836- Thailand (`"TH"`)
15862+ **Asia-Pacific (17 countries):** Australia (`"AU"`), Bangladesh (`"BD"`),
15863+ China (`"CN"`), Hong Kong (`"HK"`), India (`"IN"`), Indonesia (`"ID"`),
15864+ Japan (`"JP"`), Malaysia (`"MY"`), New Zealand (`"NZ"`), Pakistan (`"PK"`),
15865+ Philippines (`"PH"`), Singapore (`"SG"`), South Korea (`"KR"`),
15866+ Sri Lanka (`"LK"`), Taiwan (`"TW"`), Thailand (`"TH"`), Vietnam (`"VN"`)
1583715867
15838- **Middle East & Africa (4 countries):** Nigeria (`"NG"`), South Africa (`"ZA"`),
15839- Turkey (`"TR"`), United Arab Emirates (`"AE"`)
15868+ **Middle East & Africa (13 countries):** Algeria (`"DZ"`), Egypt (`"EG"`),
15869+ Ethiopia (`"ET"`), Ghana (`"GH"`), Kenya (`"KE"`), Morocco (`"MA"`),
15870+ Nigeria (`"NG"`), Senegal (`"SN"`), South Africa (`"ZA"`), Tunisia (`"TN"`),
15871+ Turkey (`"TR"`), Uganda (`"UG"`), United Arab Emirates (`"AE"`)
1584015872
1584115873 Pytest Fixture
1584215874 --------------
@@ -16263,7 +16295,8 @@ string_field(min_length: 'int | None' = None, max_length: 'int | None' = None, p
1626316295 `"2012-05-12 – 2015-11-22"`), `"future_date"` (up to 1 year ahead), `"past_date"`
1626416296 (up to 10 years back), `"time"`
1626516297
16266- **Miscellaneous:** `"color_name"`, `"file_name"`, `"file_extension"`, `"mime_type"`
16298+ **Miscellaneous:** `"color_name"`, `"file_name"`, `"file_extension"`, `"mime_type"`,
16299+ `"user_agent"` (browser user agent string with country-specific browser weighting)
1626716300
1626816301 Coherent Data Generation
1626916302 ------------------------
0 commit comments