|
1 | 1 | # TPC-H Data Generator CLI |
2 | 2 |
|
3 | | -See the main [README.md](https://github.com/clflushopt/tpchgen-rs) for full documentation. |
| 3 | +`tpchgen-cli` is a high-performance, parallel TPC-H data generator command line |
| 4 | +tool |
4 | 5 |
|
5 | | -## Installation |
| 6 | +This tool is more than 10x faster than the next fastest TPCH generator we know |
| 7 | +of (`duckdb`). On a 2023 Mac M3 Max laptop, it easily generates data faster than |
| 8 | +can be written to SSD. See [BENCHMARKS.md] for more details on performance and |
| 9 | +benchmarking. |
6 | 10 |
|
7 | | -### Install Using Python |
| 11 | +[BENCHMARKS.md]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md |
| 12 | + |
| 13 | +* See the tpchgen [README.md](https://github.com/clflushopt/tpchgen-rs) for |
| 14 | +project details |
| 15 | +* Watch this [awesome demo](https://www.youtube.com/watch?v=UYIC57hlL14) by |
| 16 | +[@alamb](https://github.com/alamb) to see `tpchgen-cli` in action |
| 17 | +* Read the companion blog post in the |
| 18 | +[Datafusion |
| 19 | +blog](https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator/) to learn about the project's history |
| 20 | +* Try it yourself by following the instructions below |
| 21 | + |
| 22 | +## Install via `pip` |
8 | 23 |
|
9 | | -Install this tool with Python: |
10 | 24 | ```shell |
11 | 25 | pip install tpchgen-cli |
12 | 26 | ``` |
13 | 27 |
|
14 | | -### Install Using Rust |
| 28 | +## Install via Rust |
15 | 29 |
|
16 | | -[Install Rust](https://www.rust-lang.org/tools/install) and this tool: |
| 30 | +[Install Rust](https://www.rust-lang.org/tools/install) and compile |
17 | 31 |
|
18 | 32 | ```shell |
19 | 33 | curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh |
20 | | -cargo install tpchgen-cli |
| 34 | +RUSTFLAGS='-C target-cpu=native' cargo install tpchgen-cli |
21 | 35 | ``` |
22 | 36 |
|
23 | | -## CLI Usage |
24 | | - |
25 | | -We tried to make the `tpchgen-cli` experience as close to `dbgen` as possible for no other |
26 | | -reason than maybe make it easier for you to have a drop-in replacement. |
| 37 | +## Examples |
27 | 38 |
|
28 | 39 | ```shell |
29 | | -$ tpchgen-cli -h |
30 | | -TPC-H Data Generator |
31 | | - |
32 | | -Usage: tpchgen-cli [OPTIONS] |
33 | | - |
34 | | -Options: |
35 | | - -s, --scale-factor <SCALE_FACTOR> |
36 | | - Scale factor to address (default: 1) [default: 1] |
37 | | - -o, --output-dir <OUTPUT_DIR> |
38 | | - Output directory for generated files (default: current directory) [default: .] |
39 | | - -T, --tables <TABLES> |
40 | | - Which tables to generate (default: all) [possible values: region, nation, supplier, customer, part, partsupp, orders, lineitem] |
41 | | - -p, --parts <PARTS> |
42 | | - Number of parts to generate (manual parallel generation) [default: 1] |
43 | | - --part <PART> |
44 | | - Which part to generate (1-based, only relevant if parts > 1) [default: 1] |
45 | | - -f, --format <FORMAT> |
46 | | - Output format: tbl, csv, parquet (default: tbl) [default: tbl] [possible values: tbl, csv, parquet] |
47 | | - -n, --num-threads <NUM_THREADS> |
48 | | - The number of threads for parallel generation, defaults to the number of CPUs [default: 8] |
49 | | - -c, --parquet-compression <PARQUET_COMPRESSION> |
50 | | - Parquet block compression format. Default is SNAPPY [default: SNAPPY] |
51 | | - -v, --verbose |
52 | | - Verbose output (default: false) |
53 | | - --stdout |
54 | | - Write the output to stdout instead of a file |
55 | | - -h, --help |
56 | | - Print help (see more with '--help') |
57 | | -``` |
| 40 | +# Scale Factor 10, all tables, in Apache Parquet format in the current directory |
| 41 | +# (3.6GB, 8 files, 60M lineitem rows, in 5 seconds on a modern laptop) |
| 42 | +tpchgen-cli -s 10 --format=parquet |
58 | 43 |
|
59 | | -For example generating a dataset with a scale factor of 1 (1GB) can be done like this: |
60 | | -```shell |
61 | | -$ tpchgen-cli -s 1 --output-dir=/tmp/tpch |
| 44 | +# Scale Factor 10, all tables, in `tbl`(csv like) format in the `sf10` directory |
| 45 | +# (10GB, 8 files, 60M lineitem rows) |
| 46 | +tpchgen-cli -s 10 --output-dir sf10 |
| 47 | + |
| 48 | +# Scale Factor 1000, lineitem table, in Apache Parquet format in sf1000 directory, |
| 49 | +# 20 part(ititons), 100MB row groups |
| 50 | +# (220GB, 20 files, 6B lineitem rows, 3.5 minutes on a modern laptop) |
| 51 | +tpchgen-cli -s 1000 --tables lineitem --parts 20 --format=parquet --parquet-row-group-bytes=100000000 --output-dir sf1000 |
| 52 | + |
| 53 | +# Scale Factor 10, partition 2 and 3 of 10 in sf10 directory |
| 54 | +# |
| 55 | +# partitioned/ |
| 56 | +# ├── lineitem |
| 57 | +# │ ├── lineitem.2.tbl |
| 58 | +# │ └── lineitem.3.tbl |
| 59 | +# └── orders |
| 60 | +# ├── orders.2.tbl |
| 61 | +# └── orders.3.tbl |
| 62 | +# |
| 63 | +for PART in `seq 2 3`; do |
| 64 | + tpchgen-cli --tables lineitem,orders --scale-factor=10 --output-dir partitioned --parts 10 --part $PART |
| 65 | +done |
62 | 66 | ``` |
| 67 | + |
| 68 | +## Performance |
| 69 | + |
| 70 | +| Scale Factor | `tpchgen-cli` | DuckDB | DuckDB (proprietary) | |
| 71 | +| ------------ | ------------- | ---------- | -------------------- | |
| 72 | +| 1 | `0:02.24` | `0:12.29` | `0:10.68` | |
| 73 | +| 10 | `0:09.97` | `1:46.80` | `1:41.14` | |
| 74 | +| 100 | `1:14.22` | `17:48.27` | `16:40.88` | |
| 75 | +| 1000 | `10:26.26` | N/A (OOM) | N/A (OOM) | |
| 76 | + |
| 77 | +- DuckDB (proprietary) is the time required to create TPCH data using the |
| 78 | + proprietary DuckDB format |
| 79 | +- Creating Scale Factor 1000 data in DuckDB [required 647 GB of memory](https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator), |
| 80 | + which is why it is not included in the table above. |
| 81 | + |
| 82 | +Times to create TPCH tables in Parquet format using `tpchgen-cli` and `duckdb` for various scale factors. |
| 83 | + |
0 commit comments