Skip to content

Latest commit

 

History

History
212 lines (166 loc) · 7.91 KB

File metadata and controls

212 lines (166 loc) · 7.91 KB

TPC-DS Test Scripts

This directory contains scripts for testing the Rust TPC-DS implementation against two reference implementations:

  1. Java / Trino (default, --compat trino) — the Java port of dsdgen used by Trino. The Rust port was originally derived from this and is expected to be byte-for-byte identical.
  2. C dsdgen (--compat c) — the original TPC-supplied reference implementation. The --compat c mode corrects bugs in the Java port to match the C reference (see BUGS.md and the parent README).

Both conformance suites validate byte-for-byte identical output via MD5/diff comparison.

Directory Structure

tpcdsgen/
├── tests/
│   └── fixtures/
│       ├── scale-1-trino/        # Java reference (`--compat trino`)
│       │   ├── MD5SUMS           # checked into git, used for output comparisons
│       │   ├── call_center.dat   # *.dat files are gitignored; generated by
│       │   ├── warehouse.dat     #   generate-fixtures.sh (only needed for --full)
│       │   └── ... (all 25 tables)
│       └── scale-1-c/           # C dsdgen reference (`--compat c`)
│           ├── MD5SUMS
│           ├── call_center.dat
│           ├── warehouse.dat
│           └── ... (all 25 tables)
└── scripts/
    ├── bootstrap-trino.sh        # Clone + build the Java TPC-DS impl
    ├── generate-fixtures.sh     # Generate/download reference fixtures
    │                            #   (Java via --compat trino; C via --compat c)
    ├── compare-table.sh         # Compare one table
    ├── test-all-tables.sh       # Compare all ported tables
    ├── clean-fixtures.sh        # Clean fixtures
    └── README.md                # This file

Quick Start — Java conformance (--compat trino)

# Default (MD5-only):
./scripts/test-all-tables.sh

If that fails, re-run byte-for-byte against the .dat fixtures — this needs a one-time Java bootstrap and fixture generation:

./scripts/bootstrap-trino.sh                      # first time only
./scripts/generate-fixtures.sh                    # produces .dat files
./scripts/test-all-tables.sh --full

Quick Start — C dsdgen conformance (--compat c)

The C reference data is pre-generated and published in alamb/tpcds-data, one branch per scale factor (sf1, sf2, ...). For the default MD5-only path the checked-in MD5SUMS is enough; only --full needs the actual data, which generate-fixtures.sh --compat c clones with --depth 1 and extracts into tests/fixtures/scale-N-c/.

# Default (MD5-only): no download needed.
./scripts/test-all-tables.sh --compat c

# Byte-for-byte (--full): download the C reference data first.
./scripts/generate-fixtures.sh --compat c              # sf1
./scripts/generate-fixtures.sh --compat c --scale 2    # sf2
./scripts/test-all-tables.sh --compat c --full

# Or compare a single table.
./scripts/compare-table.sh reason --compat c                 # MD5-only
./scripts/compare-table.sh reason --compat c --full          # byte-for-byte

Scripts

Each script is self-documenting — open it and read the header comment for full usage, flags, environment variables, output, and exit codes. The table below is just a roadmap.

Script Purpose
bootstrap-trino.sh Clone and build the Java / Trino reference implementation into ../tpcds/. Run once before Java conformance.
generate-fixtures.sh Populate tests/fixtures/scale-N-{trino,c}/ with reference data. --compat trino (default) runs the Java impl; --compat c downloads pre-generated C dsdgen data from alamb/tpcds-data.
compare-table.sh Compare one table's Rust output against the selected reference. Default: MD5-only against MD5SUMS. --full: byte-for-byte against the .dat fixture (MD5 + diff).
test-all-tables.sh Run the full conformance suite for one compat mode (the main CI entry point). Default: MD5-only. --full: byte-for-byte. Honors per-mode skip lists at the top of the script.
clean-fixtures.sh Remove all generated fixtures under tests/fixtures/.

Run any script with --help to print its usage block.


Typical Workflow

Default (MD5-only)

./scripts/compare-table.sh <table>                # one table, vs. Trino
./scripts/test-all-tables.sh                      # all tables, vs. Trino
./scripts/test-all-tables.sh --compat c           # all tables, vs. C dsdgen

No reference data download needed — the comparison reads MD5SUMS straight from the repo.

Byte-for-byte (--full)

Use when an MD5 mismatch needs a row-level diff.

# Java reference: generate fixtures, then compare.
./scripts/generate-fixtures.sh                    # one-time
./scripts/test-all-tables.sh --full

# C dsdgen reference: download fixtures, then compare.
./scripts/generate-fixtures.sh --compat c         # one-time
./scripts/test-all-tables.sh --compat c --full

Cleanup

./scripts/clean-fixtures.sh --yes      # remove all fixtures

Requirements

  • MD5-only (default): just a Cargo-built tpcdsgen binary at target/debug/tpcdsgen or target/release/tpcdsgen. No Java, no C reference data, no fixture download.
  • --full, Java: Maven-built TPC-DS JAR at ../tpcds/target/tpcds-*-jar-with-dependencies.jar (bootstrap-trino.sh handles this).
  • --full, C dsdgen reference: git, tar, bzip2 for generate-fixtures.sh --compat c. No C compiler required — data is pre-generated.
  • Disk space (--full): ~1 GB for SF1 Java fixtures; ~2.4 GB for SF1 C fixtures.

Troubleshooting

Problem: Java JAR not found

cd ../tpcds
mvn clean package

Problem: Rust binary not found

cargo build --release

Problem: Fixture not found (Java path)

./scripts/generate-fixtures.sh X

Problem: Fixture not found (C path)

./scripts/generate-fixtures.sh --compat c --scale N

Problem: Tables don't match

  1. Check that the right compat mode is selected (--compat trino vs --compat c).
  2. Re-run with --full to get a row-level diff (downloads the reference fixtures).
  3. Verify both sides use the same seed (the Rust generator is deterministic).
  4. Use the diff output to find the first difference and debug the specific row/column.

Integration with CI/CD

These scripts are designed to be CI-friendly. The default (MD5-only) path skips the slow reference-data step entirely:

# Java conformance (MD5-only)
- run: ./scripts/test-all-tables.sh --quiet

# C dsdgen conformance (MD5-only)
- run: ./scripts/test-all-tables.sh --compat c --quiet

If a job needs a row-level diff on failure, add --full (and the matching fixture step):

# Java conformance (--full)
- run: ./scripts/bootstrap-trino.sh
- run: ./scripts/generate-fixtures.sh --quiet
- run: ./scripts/test-all-tables.sh --full --quiet

# C dsdgen conformance (--full)
- run: ./scripts/generate-fixtures.sh --compat c
- run: ./scripts/test-all-tables.sh --compat c --full --quiet

Exit codes make it easy to fail CI on mismatches.

Notes

  • Fixtures are gitignored - They're generated artifacts, not source code
  • Deterministic output - Same seed always produces same data
  • Byte-for-byte equality - Not just row count, complete binary match
  • Bug compatibility - Maintains same quirks as Java/C versions (e.g., leap year bug)