Docs: Readme revamp (#179)

alamb · web-flow · commit 5a6b7da1076d · 2025-08-26T06:09:19.000-04:00
* docs: Update README

* updates

* update docs
diff --git a/README.md b/README.md
@@ -20,68 +20,18 @@ Blazing fast [TPCH] benchmark data generator, in pure Rust with zero dependencie
 
 ## Try it now
 
-### Install Using Python
+The easiest way to use this software is via the [`tpchgen-cli`] tool.
 
-Install this tool with Python:
-
-```shell
-pip install tpchgen-cli
-```
-
-```shell
-# create Scale Factor 10 (3.6GB, 8 files, 60M rows in lineitem) in 5 seconds on a modern laptop
-tpchgen-cli -s 10 --format=parquet
-```
-
-### Install Using Rust
-
-[Install Rust](https://www.rust-lang.org/tools/install) and this tool:
-
-```shell
-curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
-cargo install tpchgen-cli
-```
-
-```shell
-# create Scale Factor 10 (3.6GB, 8 files, 60M rows in lineitem) in 5 seconds on a modern laptop
-tpchgen-cli -s 10 --format=parquet
-```
-
-Or watch this [awesome demo](https://www.youtube.com/watch?v=UYIC57hlL14) recorded by [@alamb](https://github.com/alamb)
-and the companion blog post in the [Datafusion blog](https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator/).
-
-### Examples
-
-```shell
-
-# Create a scale factor 10 dataset in the native table format.
-tpchgen-cli -s 10 --output-dir sf10
-
-# Create a scale factor 1 dataset in Parquet format.
-tpchgen-cli -s 1 --output-dir sf1-parquet --format=parquet
+## Performance
 
-# Create a scale factor 1 (default) partitioned dataset for the region, nation, orders
-# and customer tables.
-tpchgen-cli --tables region,nation,orders,customer --output-dir sf1-partitioned --parts 10 --part 2
+[`tpchgen-cli`] is more than 10x faster than the next fastest TPCH generator we
+know of. On a 2023 Mac M3 Max laptop, it easily generates data faster than can
+be written to SSD. See [BENCHMARKS.md](./benchmarks/BENCHMARKS.md) for more
+details on performance and benchmarking.
 
-# Create a scale factor 1 partitioned into separate folders.
-#
-# Each folder will have a single partition of rows, the partition size will depend on the scale
-# factor. For tables that have less rows than the minimum partition size like "nation" or "region"
-# the generator will produce the same file in each part.
-#
-# $ md5sum part-*/{nation,region}.tbl
-# 2f588e0b7fa72939b498c2abecd9fbbe  part-1/nation.tbl
-# 2f588e0b7fa72939b498c2abecd9fbbe  part-2/nation.tbl
-# c235841b00d29ad4f817771fcc851207  part-1/region.tbl
-# c235841b00d29ad4f817771fcc851207  part-2/region.tbl
-for PART in `seq 1 2`; do
-  mkdir part-$PART
-  tpchgen-cli --tables region,nation,orders,customer --output-dir part-$PART --parts 10 --part $PART
-done
-```
+[`tpchgen-cli`]: ./tpchgen-cli/README.md
 
-## Performance
+Times to create TPCH tables in Parquet format using `tpchgen-cli` and `duckdb` for various scale factors.
 
 | Scale Factor | `tpchgen-cli` | DuckDB     | DuckDB (proprietary) |
 | ------------ | ------------- | ---------- | -------------------- |
@@ -92,18 +42,11 @@ done
 
 - DuckDB (proprietary) is the time required to create TPCH data using the
   proprietary DuckDB format
-- Creating Scale Factor 1000 data in DuckDB [required 647 GB of memory](https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator),
+- Creating Scale Factor 1000 using DuckDB [required 647 GB of memory](https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator),
   which is why it is not included in the table above.
 
-Times to create TPCH tables in Parquet format using `tpchgen-cli` and `duckdb` for various scale factors.
-
 ![Parquet Generation Performance](parquet-performance.png)
 
-[`tpchgen-cli`](./tpchgen-cli/README.md) is more than 10x faster than the next
-fastest TPCH generator we know of. On a 2023 Mac M3 Max laptop, it easily
-generates data faster than can be written to SSD. See
-[BENCHMARKS.md](./benchmarks/BENCHMARKS.md) for more details on performance and
-benchmarking.
 
 ## Answers
 
diff --git a/tpchgen-arrow/README.md b/tpchgen-arrow/README.md
@@ -1,8 +1,9 @@
 # TPC-H Data Generator in Arrow format
 
-This crate generates TPCH data directly into [Apache Arrow] format using the [arrow] crate
+Generate TPCH data directly into [Apache Arrow] format using the [tpchgen] and [arrow] crate.
 
 [Apache Arrow]: https://arrow.apache.org/
+[tpchgen]: https://crates.io/crates/tpchgen
 [arrow]: https://crates.io/crates/arrow
 
 # Example usage: 
diff --git a/tpchgen-cli/README.md b/tpchgen-cli/README.md
@@ -1,62 +1,83 @@
 # TPC-H Data Generator CLI
 
-See the main [README.md](https://github.com/clflushopt/tpchgen-rs) for full documentation.
+`tpchgen-cli` is a high-performance, parallel TPC-H data generator command line
+tool
 
-## Installation
+This tool is more than 10x faster than the next fastest TPCH generator we know
+of (`duckdb`). On a 2023 Mac M3 Max laptop, it easily generates data faster than
+can be written to SSD. See [BENCHMARKS.md] for more details on performance and
+benchmarking.
 
-### Install Using Python
+[BENCHMARKS.md]: https://github.com/clflushopt/tpchgen-rs/blob/main/benchmarks/BENCHMARKS.md
+
+* See the tpchgen [README.md](https://github.com/clflushopt/tpchgen-rs) for
+project details
+* Watch this [awesome demo](https://www.youtube.com/watch?v=UYIC57hlL14)  by
+[@alamb](https://github.com/alamb) to see `tpchgen-cli` in action
+* Read the companion blog post in the
+[Datafusion
+blog](https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator/) to learn about the project's history
+* Try it yourself by following the instructions below
+
+## Install via `pip`
 
-Install this tool with Python:
 ```shell
 pip install tpchgen-cli
 ```
 
-### Install Using Rust
+## Install via Rust
 
-[Install Rust](https://www.rust-lang.org/tools/install) and this tool:
+[Install Rust](https://www.rust-lang.org/tools/install) and compile
 
 ```shell
 curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
-cargo install tpchgen-cli
+RUSTFLAGS='-C target-cpu=native' cargo install tpchgen-cli
 ```
 
-## CLI Usage
-
-We tried to make the `tpchgen-cli` experience as close to `dbgen` as possible for no other
-reason than maybe make it easier for you to have a drop-in replacement.
+## Examples
 
 ```shell
-$ tpchgen-cli -h
-TPC-H Data Generator
-
-Usage: tpchgen-cli [OPTIONS]
-
-Options:
-  -s, --scale-factor <SCALE_FACTOR>
-          Scale factor to address (default: 1) [default: 1]
-  -o, --output-dir <OUTPUT_DIR>
-          Output directory for generated files (default: current directory) [default: .]
-  -T, --tables <TABLES>
-          Which tables to generate (default: all) [possible values: region, nation, supplier, customer, part, partsupp, orders, lineitem]
-  -p, --parts <PARTS>
-          Number of parts to generate (manual parallel generation) [default: 1]
-      --part <PART>
-          Which part to generate (1-based, only relevant if parts > 1) [default: 1]
-  -f, --format <FORMAT>
-          Output format: tbl, csv, parquet (default: tbl) [default: tbl] [possible values: tbl, csv, parquet]
-  -n, --num-threads <NUM_THREADS>
-          The number of threads for parallel generation, defaults to the number of CPUs [default: 8]
-  -c, --parquet-compression <PARQUET_COMPRESSION>
-          Parquet block compression format. Default is SNAPPY [default: SNAPPY]
-  -v, --verbose
-          Verbose output (default: false)
-      --stdout
-          Write the output to stdout instead of a file
-  -h, --help
-          Print help (see more with '--help')
-```
+# Scale Factor 10, all tables, in Apache Parquet format in the current directory
+# (3.6GB, 8 files, 60M lineitem rows, in 5 seconds on a modern laptop)
+tpchgen-cli -s 10 --format=parquet
 
-For example generating a dataset with a scale factor of 1 (1GB) can be done like this:
-```shell
-$ tpchgen-cli -s 1 --output-dir=/tmp/tpch
+# Scale Factor 10, all tables, in `tbl`(csv like) format in the `sf10` directory
+# (10GB, 8 files, 60M lineitem rows)
+tpchgen-cli -s 10 --output-dir sf10
+
+# Scale Factor 1000, lineitem table, in Apache Parquet format in sf1000 directory, 
+# 20 part(ititons), 100MB row groups
+# (220GB, 20 files, 6B lineitem rows, 3.5 minutes on a modern laptop)
+tpchgen-cli -s 1000 --tables lineitem --parts 20 --format=parquet --parquet-row-group-bytes=100000000 --output-dir sf1000
+
+# Scale Factor 10, partition 2 and 3 of 10 in sf10 directory
+#
+# partitioned/
+# ├── lineitem
+# │   ├── lineitem.2.tbl
+# │   └── lineitem.3.tbl
+# └── orders
+#    ├── orders.2.tbl
+#    └── orders.3.tbl
+#     
+for PART in `seq 2 3`; do
+  tpchgen-cli --tables lineitem,orders --scale-factor=10 --output-dir partitioned --parts 10 --part $PART
+done
 ```
+
+## Performance
+
+| Scale Factor | `tpchgen-cli` | DuckDB     | DuckDB (proprietary) |
+| ------------ | ------------- | ---------- | -------------------- |
+| 1            | `0:02.24`     | `0:12.29`  | `0:10.68`            |
+| 10           | `0:09.97`     | `1:46.80`  | `1:41.14`            |
+| 100          | `1:14.22`     | `17:48.27` | `16:40.88`           |
+| 1000         | `10:26.26`    | N/A (OOM)  | N/A (OOM)            |
+
+- DuckDB (proprietary) is the time required to create TPCH data using the
+  proprietary DuckDB format
+- Creating Scale Factor 1000 data in DuckDB [required 647 GB of memory](https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator),
+  which is why it is not included in the table above.
+
+Times to create TPCH tables in Parquet format using `tpchgen-cli` and `duckdb` for various scale factors.
+
diff --git a/tpchgen/Cargo.toml b/tpchgen/Cargo.toml
@@ -3,7 +3,7 @@ name = "tpchgen"
 authors = ["clflushopt", "alamb"]
 description = "Blazing fast pure Rust no dependency TPC-H data generation library."
 repository = "https://github.com/clflushopt/tpchgen-rs"
-readme = { workspace = true }
+readme = "README.md"
 version = { workspace = true }
 edition = { workspace = true }
 homepage = { workspace = true }
diff --git a/tpchgen/README.md b/tpchgen/README.md
@@ -0,0 +1,8 @@
+# TPC-H Data Generator Crate 
+
+This crate provides the core data generator logic for TPC-H. It has no
+dependencies and is easy to embed in any other Rust projects.
+
+See the [docs.rs page](https://docs.rs/tpchgen/latest/tpchgen/) for API and the
+the tpchgen [README.md](https://github.com/clflushopt/tpchgen-rs) for more
+information on the project.
diff --git a/tpchgen/src/lib.rs b/tpchgen/src/lib.rs
@@ -43,14 +43,14 @@
 //! [`LineItem`]: generators::LineItem
 //! [`LineItemCsv`]: csv::LineItemCsv
 //!
-//!
 //! The library was designed to be easily integrated in existing Rust projects as
 //! such it avoids exposing a malleable API and purposely does not have any dependencies
-//! on other Rust crates. It is focused entire on the core
+//! on other Rust crates. It is focused entirely on the core
 //! generation logic.
 //!
 //! If you want an easy way to generate the TPC-H dataset for usage with external
-//! systems you can use CLI tool instead.
+//! see the [`tpchgen-cli`](https://github.com/alamb/tpchgen-rs/tree/main/tpchgen-cli)
+//! tool instead.
 pub mod csv;
 pub mod dates;
 pub mod decimal;