Skip to content

Latest commit

 

History

History
182 lines (139 loc) · 6.16 KB

File metadata and controls

182 lines (139 loc) · 6.16 KB

TPC-DS Data Generator Crate

This crate provides the core data generator logic for TPC-H.

Usage

# Build the generator
cargo build --release

# Generate all tables at scale factor 1 (default)
./target/release/tpcdsgen

# Generate all tables at scale factor 10
./target/release/tpcdsgen --scale 10

# Generate specific table
./target/release/tpcdsgen --table store_sales --scale 10

# Generate to a specific directory
./target/release/tpcdsgen --scale 10 --directory /path/to/output

Generating Fixtures

Fixtures are pre-generated TPC-DS data files used for conformance testing.

Directory Structure

tests/fixtures/
├── scale-1-trino/    # Java reference fixtures (`--compat trino`)
├── scale-1-c/       # C dsdgen reference fixtures (`--compat c`)
└── scale-10-trino/   # higher scale factors as needed

Conformance Testing

tpcdsgen ships with two conformance suites, both implemented as shell scripts that do byte-for-byte (MD5) comparison of .dat output. See scripts/README.md for full details.

vs. Java / Trino reference (default, --compat trino):

# One-time: clone & build the Java TPC-DS implementation.
./scripts/bootstrap-trino.sh

# Generate Java reference fixtures into tests/fixtures/scale-N-trino/.
./scripts/generate-fixtures.sh

# Compare Rust output byte-for-byte against the Java fixtures.
./scripts/test-all-tables.sh --scale 1

vs. C dsdgen reference (--compat c):

# One-time: download pre-generated C dsdgen data from
# https://github.com/alamb/tpcds-data into tests/fixtures/scale-N-c/.
./scripts/generate-fixtures.sh --compat c --scale 1

# Compare Rust --compat c output byte-for-byte against the C fixtures.
./scripts/test-all-tables.sh --compat c --scale 1

Both suites also support comparing a single table:

./scripts/compare-table.sh reason                # vs. Java
./scripts/compare-table.sh reason --compat c     # vs. C dsdgen

Verifying Fixtures with MD5SUMS

Each fixture directory contains an MD5SUMS file for verification.

On Linux:

cd tests/fixtures/scale-1-trino
md5sum -c MD5SUMS

On macOS:

cd tests/fixtures/scale-1-trino
while read hash file; do
  [[ $(md5 -q "$file") == "$hash" ]] && echo "$file: OK" || echo "$file: FAILED"
done < MD5SUMS

Known Bugs

The TPC-DS reference implementation contains several bugs that must be replicated for benchmark compliance. These bugs originated in the C implementation and were faithfully reproduced in the Java port. Our Rust implementation also replicates these bugs to ensure byte-for-byte compatibility with the reference implementation.

See BUGS.md for a detailed list of documented bugs, more will be added.

TPC-DS Reference MD5 Hashes

These are the canonical MD5 hashes for TPC-DS data generated by the Java reference implementation. The Rust implementation must produce byte-for-byte identical output.

Scale 1

Generated with: java -jar tpcds-*.jar --scale 1

Table MD5 Hash
call_center.dat cc9aabc63eb8603bd7330b6735ed0961
catalog_page.dat 0bbac1b8bdcf8ce2d5f0034980ee0196
catalog_returns.dat 8460b5abd6b6ceaf6107f217b016fb23
catalog_sales.dat 51a0bc401b4b64d94736634b54068240
customer.dat 3672ffdefac3cf00413ecef71a753636
customer_address.dat abac2e3925ab9bf66cec3b527a0468ed
customer_demographics.dat 8831872c6d56ea9d4f24701f2feaef48
date_dim.dat f3e77714328dcc57302777e72fd7747c
dbgen_version.dat a430da74c2e44926c53deb74e35b23f1 *
household_demographics.dat dccf2ff17c5e420021fbf92bf9a0a5ec
income_band.dat db8e8012be51ef81cf215774bec95533
inventory.dat cfefc8724693ec9149f1d5b345fcecc2
item.dat bebbcfd1acecdea16a5a3feb5e4deb96
promotion.dat acb42558d0dc5e0ab6df5a664c1629cf
reason.dat 57fe9b8688095bd345cc846ec4400be0
ship_mode.dat 791d16af982a67ad170a6b6527e25a35
store.dat 80082d03e1b01340e19db3187d8edbd6
store_returns.dat 9009d804c02ee839e0b2ecd5fb4ae03f
store_sales.dat f003b3810e042d6dd47f48506616d88d
time_dim.dat a68339c5720d25380b53f6e0f2f72333
warehouse.dat f56789e8b724b989d74e213e0686052f
web_page.dat 6feef91675c336d6f25e55ebbdf8c13c
web_returns.dat e45390d32d1698fef71f05f474a4d748
web_sales.dat 15f9d835727f3a39a096c346f56e51f7
web_site.dat de5fb00a80673cb44b4b508da75d4bcf

Scale 10

Generated with: java -jar tpcds-*.jar --scale 10

Table MD5 Hash
call_center.dat 235909679f4d125e769aa38eb16e9098
catalog_page.dat a5daa0d93ecde8bd9f6ed79cd3b63916
catalog_returns.dat 982a8b96fa0d9487015cd137136c8f68
catalog_sales.dat 97d5351b430d6c15e3906518315f0787
customer.dat 486a030a55d468ef15ff2ff01583e6dc
customer_address.dat 860602fea368111009ef08b167e1e299
customer_demographics.dat 8831872c6d56ea9d4f24701f2feaef48
date_dim.dat f3e77714328dcc57302777e72fd7747c
dbgen_version.dat 8553e926c33f4ad84e4d58fcfd20c48c *
household_demographics.dat dccf2ff17c5e420021fbf92bf9a0a5ec
income_band.dat db8e8012be51ef81cf215774bec95533
inventory.dat 4ad3640917c6567038f081bbe2cf0e3e
item.dat bff29691c74ae66eb2dcc3af686fb2ba
promotion.dat b8e8a7741f64edc5d09fdb0453c86705
reason.dat a1fdcd35ca0eddd0d5f37b0e5c2fddb3
ship_mode.dat 791d16af982a67ad170a6b6527e25a35
store.dat 430a01467a2d55d0e9a1bebad4f1c44b
store_returns.dat 4ba001a6066db20066cd198242f92ca1
store_sales.dat ecff92350fa0466e9b9407a1b5ad4020
time_dim.dat a68339c5720d25380b53f6e0f2f72333
warehouse.dat e0c56fe622774d09c9dec42029881ad5
web_page.dat e55695fdb2b86f96cf46e2a55b6f3748
web_returns.dat ac0197593d3f4cc3bb46c8ad7e6cd735
web_sales.dat 4da375300bcb0ce8785e1f100fb72efe
web_site.dat 4669d52e36cd112af10e137e5d8d7697

* dbgen_version.dat contains timestamps and will differ between runs.

Verification

To verify the Rust implementation matches:

# Verify at scale 1
./scripts/test-all-tables.sh --scale 1

# Verify at scale 10
./scripts/test-all-tables.sh --scale 10