@@ -369,6 +369,57 @@ You can use either ISO 3166-1 alpha-2 codes (e.g., `"US"`) or alpha-3 codes (e.g
369369
370370Additional countries and expanded coverage are planned for future releases.
371371
372+ ### Mixing Multiple Countries
373+
374+ When you need test data that spans multiple locales (e.g., simulating an international customer
375+ base), you can pass a list or dict to the ` country= ` parameter instead of a single string.
376+
377+ Passing a list of country codes splits rows equally across those countries. Here, 200 rows are
378+ divided evenly among the US, Germany, and Japan (~ 67 each):
379+
380+ ``` {python}
381+ schema = pb.Schema(
382+ name=pb.string_field(preset="name"),
383+ city=pb.string_field(preset="city"),
384+ postcode=pb.string_field(preset="postcode"),
385+ )
386+
387+ pb.preview(pb.generate_dataset(schema, n=200, seed=23, country=["US", "DE", "JP"]))
388+ ```
389+
390+ To control the proportion of rows per country, pass a dict mapping country codes to weights. The
391+ following generates 200 rows with 70% from the US, 20% from Germany, and 10% from France:
392+
393+ ``` {python}
394+ pb.preview(
395+ pb.generate_dataset(
396+ schema, n=200, seed=23,
397+ country={"US": 0.7, "DE": 0.2, "FR": 0.1},
398+ )
399+ )
400+ ```
401+
402+ Weights are auto-normalized, so ` {"US": 7, "DE": 2, "FR": 1} ` is equivalent to the example above.
403+ Row counts are allocated using largest-remainder apportionment, ensuring they always sum to exactly
404+ ` n ` .
405+
406+ By default, rows from different countries are interleaved randomly (` shuffle=True ` ). Set
407+ ` shuffle=False ` to keep rows grouped by country in the order the countries are listed:
408+
409+ ``` {python}
410+ pb.preview(
411+ pb.generate_dataset(
412+ schema, n=120, seed=23,
413+ country=["US", "DE", "JP"], shuffle=False,
414+ )
415+ )
416+ ```
417+
418+ All coherence systems (address, person, business) work correctly within each country's batch of
419+ rows. A French row will have a French name with a matching French email; a Japanese row will have a
420+ Japanese name with a matching Japanese email. Non-preset columns (integers, floats, booleans, dates)
421+ are generated independently for each batch but still respect their field constraints.
422+
372423## Output Formats
373424
374425The ` generate_dataset() ` function supports multiple output formats via the ` output= ` parameter,
@@ -381,20 +432,20 @@ schema = pb.Schema(
381432)
382433```
383434
384- The default output is a ** Polars DataFrame** , which offers excellent performance and a modern API
385- for data manipulation:
435+ The default output is a Polars DataFrame, which offers excellent performance and a modern API for
436+ data manipulation:
386437
387438``` {python}
388- # Polars DataFrame (default)
389439polars_df = pb.generate_dataset(schema, n=100, seed=23, output="polars")
440+
390441pb.preview(polars_df)
391442```
392443
393444If your workflow uses Pandas, simply specify ` output="pandas" ` to get a ** Pandas DataFrame** :
394445
395446``` {python}
396- # Pandas DataFrame
397447pandas_df = pb.generate_dataset(schema, n=100, seed=23, output="pandas")
448+
398449pb.preview(pandas_df)
399450```
400451
@@ -592,7 +643,7 @@ By incorporating test data generation into your process, you can:
592643- create reproducible test fixtures for automated testing and CI/CD pipelines
593644- generate locale-specific data for internationalization testing across 55 countries
594645- ensure coherent relationships between related fields like names, emails, addresses, jobs, and
595- license plates
646+ license plates
596647- produce datasets of any size with consistent, realistic values
597648
598649Whether you're building validation logic, testing data pipelines, or simply need sample data for
0 commit comments