Skip to content

Commit f4304a0

Browse files
committed
Update llms-full.txt
1 parent 61f9b87 commit f4304a0

1 file changed

Lines changed: 48 additions & 15 deletions

File tree

docs/llms-full.txt

Lines changed: 48 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -15744,7 +15744,7 @@ Generate synthetic test data based on schema definitions. Use
1574415744
`generate_dataset()` to create data from a `Schema` object. The helper functions define typed fields
1574515745
with constraints for realistic test data generation.
1574615746

15747-
generate_dataset(schema: 'Schema', n: 'int' = 100, seed: 'int | None' = None, output: "Literal['polars', 'pandas', 'dict']" = 'polars', country: 'str' = 'US') -> 'Any'
15747+
generate_dataset(schema: 'Schema', n: 'int' = 100, seed: 'int | None' = None, output: "Literal['polars', 'pandas', 'dict']" = 'polars', country: 'str | list[str] | dict[str, float]' = 'US', shuffle: 'bool' = True, weighted: 'bool' = True) -> 'Any'
1574815748

1574915749
Generate synthetic test data from a schema.
1575015750

@@ -15769,10 +15769,22 @@ generate_dataset(schema: 'Schema', n: 'int' = 100, seed: 'int | None' = None, ou
1576915769
Polars DataFrame, (2) `"pandas"` returns a Pandas DataFrame, and (3) `"dict"` returns
1577015770
a dictionary of lists.
1577115771
country
15772-
Country code for locale-aware generation when using presets. Accepts ISO 3166-1 alpha-2
15773-
codes (e.g., `"US"`, `"DE"`, `"FR"`) or alpha-3 codes (e.g., `"USA"`, `"DEU"`, `"FRA"`).
15774-
This affects the format and content of preset-generated data such as addresses, phone
15775-
numbers, names, and postal codes. The default is `"US"`.
15772+
Country code(s) for locale-aware generation when using presets. Accepts a single
15773+
ISO 3166-1 alpha-2 or alpha-3 code (e.g., `"US"`, `"DEU"`), a list of codes for
15774+
uniform mixing (e.g., `["US", "DE", "JP"]`), or a dict mapping codes to positive
15775+
weights (e.g., `{"US": 60, "DE": 25, "JP": 15}`). See the *Locale Mixing* section
15776+
below for details. The default is `"US"`.
15777+
shuffle
15778+
When `country=` is a list or dict (multi-country mixing), controls whether rows from
15779+
different countries are interleaved randomly (`True`, the default) or grouped by country
15780+
in the order the countries are specified (`False`). Ignored when `country=` is a single
15781+
string.
15782+
weighted
15783+
When `True`, names and locations are sampled according to real-world frequency tiers.
15784+
Common names like "James" and "Smith" appear far more often than rare names. Large
15785+
cities like New York and Los Angeles dominate over small towns. Only affects data files
15786+
that have been migrated to the tiered format; flat-list data always uses uniform
15787+
sampling. Default is `True`.
1577615788

1577715789
Returns
1577815790
-------
@@ -15814,9 +15826,25 @@ generate_dataset(schema: 'Schema', n: 'int' = 100, seed: 'int | None' = None, ou
1581415826
share the same location (e.g., the city matches the address), and business-related presets
1581515827
will share the same industry context.
1581615828

15829+
Locale Mixing
15830+
-------------
15831+
The `country=` parameter accepts three input forms for flexible locale control:
15832+
15833+
(1) a **single string** (the default), such as `"US"` or `"DEU"`, which generates
15834+
all rows from one locale; (2) a **list of strings**, such as `["US", "DE", "JP"]`,
15835+
which splits rows equally across the listed countries; and (3) a **dict of weights**,
15836+
such as `{"US": 0.6, "DE": 0.3, "FR": 0.1}`, which allocates rows proportionally
15837+
(weights are auto-normalized, so `{"US": 6, "DE": 3, "FR": 1}` is equivalent).
15838+
15839+
Row counts are distributed using largest-remainder apportionment so they always sum
15840+
to exactly `n=`. Each country's rows are generated as an independent batch (preserving
15841+
all cross-column coherence within each batch), then either interleaved randomly
15842+
(`shuffle=True`, the default) or left in contiguous country blocks
15843+
(`shuffle=False`).
15844+
1581715845
Supported Countries
1581815846
-------------------
15819-
The `country=` parameter currently supports 55 countries with full locale data:
15847+
The `country=` parameter currently supports 71 countries with full locale data:
1582015848

1582115849
**Europe (32 countries):** Austria (`"AT"`), Belgium (`"BE"`), Bulgaria (`"BG"`),
1582215850
Croatia (`"HR"`), Cyprus (`"CY"`), Czech Republic (`"CZ"`), Denmark (`"DK"`),
@@ -15827,16 +15855,20 @@ generate_dataset(schema: 'Schema', n: 'int' = 100, seed: 'int | None' = None, ou
1582715855
Slovakia (`"SK"`), Slovenia (`"SI"`), Spain (`"ES"`), Sweden (`"SE"`),
1582815856
Switzerland (`"CH"`), United Kingdom (`"GB"`)
1582915857

15830-
**Americas (7 countries):** Argentina (`"AR"`), Brazil (`"BR"`), Canada (`"CA"`),
15831-
Chile (`"CL"`), Colombia (`"CO"`), Mexico (`"MX"`), United States (`"US"`)
15858+
**Americas (9 countries):** Argentina (`"AR"`), Brazil (`"BR"`), Canada (`"CA"`),
15859+
Chile (`"CL"`), Colombia (`"CO"`), Costa Rica (`"CR"`), Mexico (`"MX"`),
15860+
Peru (`"PE"`), United States (`"US"`)
1583215861

15833-
**Asia-Pacific (12 countries):** Australia (`"AU"`), China (`"CN"`), Hong Kong (`"HK"`),
15834-
India (`"IN"`), Indonesia (`"ID"`), Japan (`"JP"`), New Zealand (`"NZ"`),
15835-
Philippines (`"PH"`), Singapore (`"SG"`), South Korea (`"KR"`), Taiwan (`"TW"`),
15836-
Thailand (`"TH"`)
15862+
**Asia-Pacific (17 countries):** Australia (`"AU"`), Bangladesh (`"BD"`),
15863+
China (`"CN"`), Hong Kong (`"HK"`), India (`"IN"`), Indonesia (`"ID"`),
15864+
Japan (`"JP"`), Malaysia (`"MY"`), New Zealand (`"NZ"`), Pakistan (`"PK"`),
15865+
Philippines (`"PH"`), Singapore (`"SG"`), South Korea (`"KR"`),
15866+
Sri Lanka (`"LK"`), Taiwan (`"TW"`), Thailand (`"TH"`), Vietnam (`"VN"`)
1583715867

15838-
**Middle East & Africa (4 countries):** Nigeria (`"NG"`), South Africa (`"ZA"`),
15839-
Turkey (`"TR"`), United Arab Emirates (`"AE"`)
15868+
**Middle East & Africa (13 countries):** Algeria (`"DZ"`), Egypt (`"EG"`),
15869+
Ethiopia (`"ET"`), Ghana (`"GH"`), Kenya (`"KE"`), Morocco (`"MA"`),
15870+
Nigeria (`"NG"`), Senegal (`"SN"`), South Africa (`"ZA"`), Tunisia (`"TN"`),
15871+
Turkey (`"TR"`), Uganda (`"UG"`), United Arab Emirates (`"AE"`)
1584015872

1584115873
Pytest Fixture
1584215874
--------------
@@ -16263,7 +16295,8 @@ string_field(min_length: 'int | None' = None, max_length: 'int | None' = None, p
1626316295
`"2012-05-12 – 2015-11-22"`), `"future_date"` (up to 1 year ahead), `"past_date"`
1626416296
(up to 10 years back), `"time"`
1626516297

16266-
**Miscellaneous:** `"color_name"`, `"file_name"`, `"file_extension"`, `"mime_type"`
16298+
**Miscellaneous:** `"color_name"`, `"file_name"`, `"file_extension"`, `"mime_type"`,
16299+
`"user_agent"` (browser user agent string with country-specific browser weighting)
1626716300

1626816301
Coherent Data Generation
1626916302
------------------------

0 commit comments

Comments
 (0)