Skip to content

Commit 5a6b7da

Browse files
authored
Docs: Readme revamp (#179)
* docs: Update README * updates * update docs
1 parent 3ab33ee commit 5a6b7da

6 files changed

Lines changed: 87 additions & 114 deletions

File tree

README.md

Lines changed: 9 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -20,68 +20,18 @@ Blazing fast [TPCH] benchmark data generator, in pure Rust with zero dependencie
2020

2121
## Try it now
2222

23-
### Install Using Python
23+
The easiest way to use this software is via the [`tpchgen-cli`] tool.
2424

25-
Install this tool with Python:
26-
27-
```shell
28-
pip install tpchgen-cli
29-
```
30-
31-
```shell
32-
# create Scale Factor 10 (3.6GB, 8 files, 60M rows in lineitem) in 5 seconds on a modern laptop
33-
tpchgen-cli -s 10 --format=parquet
34-
```
35-
36-
### Install Using Rust
37-
38-
[Install Rust](https://www.rust-lang.org/tools/install) and this tool:
39-
40-
```shell
41-
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
42-
cargo install tpchgen-cli
43-
```
44-
45-
```shell
46-
# create Scale Factor 10 (3.6GB, 8 files, 60M rows in lineitem) in 5 seconds on a modern laptop
47-
tpchgen-cli -s 10 --format=parquet
48-
```
49-
50-
Or watch this [awesome demo](https://www.youtube.com/watch?v=UYIC57hlL14) recorded by [@alamb](https://github.com/alamb)
51-
and the companion blog post in the [Datafusion blog](https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator/).
52-
53-
### Examples
54-
55-
```shell
56-
57-
# Create a scale factor 10 dataset in the native table format.
58-
tpchgen-cli -s 10 --output-dir sf10
59-
60-
# Create a scale factor 1 dataset in Parquet format.
61-
tpchgen-cli -s 1 --output-dir sf1-parquet --format=parquet
25+
## Performance
6226

63-
# Create a scale factor 1 (default) partitioned dataset for the region, nation, orders
64-
# and customer tables.
65-
tpchgen-cli --tables region,nation,orders,customer --output-dir sf1-partitioned --parts 10 --part 2
27+
[`tpchgen-cli`] is more than 10x faster than the next fastest TPCH generator we
28+
know of. On a 2023 Mac M3 Max laptop, it easily generates data faster than can
29+
be written to SSD. See [BENCHMARKS.md](./benchmarks/BENCHMARKS.md) for more
30+
details on performance and benchmarking.
6631

67-
# Create a scale factor 1 partitioned into separate folders.
68-
#
69-
# Each folder will have a single partition of rows, the partition size will depend on the scale
70-
# factor. For tables that have less rows than the minimum partition size like "nation" or "region"
71-
# the generator will produce the same file in each part.
72-
#
73-
# $ md5sum part-*/{nation,region}.tbl
74-
# 2f588e0b7fa72939b498c2abecd9fbbe part-1/nation.tbl
75-
# 2f588e0b7fa72939b498c2abecd9fbbe part-2/nation.tbl
76-
# c235841b00d29ad4f817771fcc851207 part-1/region.tbl
77-
# c235841b00d29ad4f817771fcc851207 part-2/region.tbl
78-
for PART in `seq 1 2`; do
79-
mkdir part-$PART
80-
tpchgen-cli --tables region,nation,orders,customer --output-dir part-$PART --parts 10 --part $PART
81-
done
82-
```
32+
[`tpchgen-cli`]: ./tpchgen-cli/README.md
8333

84-
## Performance
34+
Times to create TPCH tables in Parquet format using `tpchgen-cli` and `duckdb` for various scale factors.
8535

8636
| Scale Factor | `tpchgen-cli` | DuckDB | DuckDB (proprietary) |
8737
| ------------ | ------------- | ---------- | -------------------- |
@@ -92,18 +42,11 @@ done
9242

9343
- DuckDB (proprietary) is the time required to create TPCH data using the
9444
proprietary DuckDB format
95-
- Creating Scale Factor 1000 data in DuckDB [required 647 GB of memory](https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator),
45+
- Creating Scale Factor 1000 using DuckDB [required 647 GB of memory](https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator),
9646
which is why it is not included in the table above.
9747

98-
Times to create TPCH tables in Parquet format using `tpchgen-cli` and `duckdb` for various scale factors.
99-
10048
![Parquet Generation Performance](parquet-performance.png)
10149

102-
[`tpchgen-cli`](./tpchgen-cli/README.md) is more than 10x faster than the next
103-
fastest TPCH generator we know of. On a 2023 Mac M3 Max laptop, it easily
104-
generates data faster than can be written to SSD. See
105-
[BENCHMARKS.md](./benchmarks/BENCHMARKS.md) for more details on performance and
106-
benchmarking.
10750

10851
## Answers
10952

tpchgen-arrow/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
11
# TPC-H Data Generator in Arrow format
22

3-
This crate generates TPCH data directly into [Apache Arrow] format using the [arrow] crate
3+
Generate TPCH data directly into [Apache Arrow] format using the [tpchgen] and [arrow] crate.
44

55
[Apache Arrow]: https://arrow.apache.org/
6+
[tpchgen]: https://crates.io/crates/tpchgen
67
[arrow]: https://crates.io/crates/arrow
78

89
# Example usage:

tpchgen-cli/README.md

Lines changed: 64 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,62 +1,83 @@
11
# TPC-H Data Generator CLI
22

3-
See the main [README.md](https://github.com/clflushopt/tpchgen-rs) for full documentation.
3+
`tpchgen-cli` is a high-performance, parallel TPC-H data generator command line
4+
tool
45

5-
## Installation
6+
This tool is more than 10x faster than the next fastest TPCH generator we know
7+
of (`duckdb`). On a 2023 Mac M3 Max laptop, it easily generates data faster than
8+
can be written to SSD. See [BENCHMARKS.md] for more details on performance and
9+
benchmarking.
610

7-
### Install Using Python
11+
[BENCHMARKS.md]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
12+
13+
* See the tpchgen [README.md](https://github.com/clflushopt/tpchgen-rs) for
14+
project details
15+
* Watch this [awesome demo](https://www.youtube.com/watch?v=UYIC57hlL14) by
16+
[@alamb](https://github.com/alamb) to see `tpchgen-cli` in action
17+
* Read the companion blog post in the
18+
[Datafusion
19+
blog](https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator/) to learn about the project's history
20+
* Try it yourself by following the instructions below
21+
22+
## Install via `pip`
823

9-
Install this tool with Python:
1024
```shell
1125
pip install tpchgen-cli
1226
```
1327

14-
### Install Using Rust
28+
## Install via Rust
1529

16-
[Install Rust](https://www.rust-lang.org/tools/install) and this tool:
30+
[Install Rust](https://www.rust-lang.org/tools/install) and compile
1731

1832
```shell
1933
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
20-
cargo install tpchgen-cli
34+
RUSTFLAGS='-C target-cpu=native' cargo install tpchgen-cli
2135
```
2236

23-
## CLI Usage
24-
25-
We tried to make the `tpchgen-cli` experience as close to `dbgen` as possible for no other
26-
reason than maybe make it easier for you to have a drop-in replacement.
37+
## Examples
2738

2839
```shell
29-
$ tpchgen-cli -h
30-
TPC-H Data Generator
31-
32-
Usage: tpchgen-cli [OPTIONS]
33-
34-
Options:
35-
-s, --scale-factor <SCALE_FACTOR>
36-
Scale factor to address (default: 1) [default: 1]
37-
-o, --output-dir <OUTPUT_DIR>
38-
Output directory for generated files (default: current directory) [default: .]
39-
-T, --tables <TABLES>
40-
Which tables to generate (default: all) [possible values: region, nation, supplier, customer, part, partsupp, orders, lineitem]
41-
-p, --parts <PARTS>
42-
Number of parts to generate (manual parallel generation) [default: 1]
43-
--part <PART>
44-
Which part to generate (1-based, only relevant if parts > 1) [default: 1]
45-
-f, --format <FORMAT>
46-
Output format: tbl, csv, parquet (default: tbl) [default: tbl] [possible values: tbl, csv, parquet]
47-
-n, --num-threads <NUM_THREADS>
48-
The number of threads for parallel generation, defaults to the number of CPUs [default: 8]
49-
-c, --parquet-compression <PARQUET_COMPRESSION>
50-
Parquet block compression format. Default is SNAPPY [default: SNAPPY]
51-
-v, --verbose
52-
Verbose output (default: false)
53-
--stdout
54-
Write the output to stdout instead of a file
55-
-h, --help
56-
Print help (see more with '--help')
57-
```
40+
# Scale Factor 10, all tables, in Apache Parquet format in the current directory
41+
# (3.6GB, 8 files, 60M lineitem rows, in 5 seconds on a modern laptop)
42+
tpchgen-cli -s 10 --format=parquet
5843

59-
For example generating a dataset with a scale factor of 1 (1GB) can be done like this:
60-
```shell
61-
$ tpchgen-cli -s 1 --output-dir=/tmp/tpch
44+
# Scale Factor 10, all tables, in `tbl`(csv like) format in the `sf10` directory
45+
# (10GB, 8 files, 60M lineitem rows)
46+
tpchgen-cli -s 10 --output-dir sf10
47+
48+
# Scale Factor 1000, lineitem table, in Apache Parquet format in sf1000 directory,
49+
# 20 part(ititons), 100MB row groups
50+
# (220GB, 20 files, 6B lineitem rows, 3.5 minutes on a modern laptop)
51+
tpchgen-cli -s 1000 --tables lineitem --parts 20 --format=parquet --parquet-row-group-bytes=100000000 --output-dir sf1000
52+
53+
# Scale Factor 10, partition 2 and 3 of 10 in sf10 directory
54+
#
55+
# partitioned/
56+
# ├── lineitem
57+
# │ ├── lineitem.2.tbl
58+
# │ └── lineitem.3.tbl
59+
# └── orders
60+
# ├── orders.2.tbl
61+
# └── orders.3.tbl
62+
#
63+
for PART in `seq 2 3`; do
64+
tpchgen-cli --tables lineitem,orders --scale-factor=10 --output-dir partitioned --parts 10 --part $PART
65+
done
6266
```
67+
68+
## Performance
69+
70+
| Scale Factor | `tpchgen-cli` | DuckDB | DuckDB (proprietary) |
71+
| ------------ | ------------- | ---------- | -------------------- |
72+
| 1 | `0:02.24` | `0:12.29` | `0:10.68` |
73+
| 10 | `0:09.97` | `1:46.80` | `1:41.14` |
74+
| 100 | `1:14.22` | `17:48.27` | `16:40.88` |
75+
| 1000 | `10:26.26` | N/A (OOM) | N/A (OOM) |
76+
77+
- DuckDB (proprietary) is the time required to create TPCH data using the
78+
proprietary DuckDB format
79+
- Creating Scale Factor 1000 data in DuckDB [required 647 GB of memory](https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator),
80+
which is why it is not included in the table above.
81+
82+
Times to create TPCH tables in Parquet format using `tpchgen-cli` and `duckdb` for various scale factors.
83+

tpchgen/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ name = "tpchgen"
33
authors = ["clflushopt", "alamb"]
44
description = "Blazing fast pure Rust no dependency TPC-H data generation library."
55
repository = "https://github.com/clflushopt/tpchgen-rs"
6-
readme = { workspace = true }
6+
readme = "README.md"
77
version = { workspace = true }
88
edition = { workspace = true }
99
homepage = { workspace = true }

tpchgen/README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# TPC-H Data Generator Crate
2+
3+
This crate provides the core data generator logic for TPC-H. It has no
4+
dependencies and is easy to embed in any other Rust projects.
5+
6+
See the [docs.rs page](https://docs.rs/tpchgen/latest/tpchgen/) for API and the
7+
the tpchgen [README.md](https://github.com/clflushopt/tpchgen-rs) for more
8+
information on the project.

tpchgen/src/lib.rs

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,14 +43,14 @@
4343
//! [`LineItem`]: generators::LineItem
4444
//! [`LineItemCsv`]: csv::LineItemCsv
4545
//!
46-
//!
4746
//! The library was designed to be easily integrated in existing Rust projects as
4847
//! such it avoids exposing a malleable API and purposely does not have any dependencies
49-
//! on other Rust crates. It is focused entire on the core
48+
//! on other Rust crates. It is focused entirely on the core
5049
//! generation logic.
5150
//!
5251
//! If you want an easy way to generate the TPC-H dataset for usage with external
53-
//! systems you can use CLI tool instead.
52+
//! see the [`tpchgen-cli`](https://github.com/alamb/tpchgen-rs/tree/main/tpchgen-cli)
53+
//! tool instead.
5454
pub mod csv;
5555
pub mod dates;
5656
pub mod decimal;

0 commit comments

Comments
 (0)