Skip to content

Commit 5cde819

Browse files
clflushoptalambclaude
authored
feat: dsdgen compatibility mode (with C) (#253)
* feat: dsdgen compatibility * chore: Use pre-generated C data files, unify comparison scripts (#257) * chore: Use pre-generated C data files, unify comparison scripts * chore: make scripts self-documenting; collapse scripts/README Move per-script docs (usage, flags, env vars, output, exit codes) into the top-of-file header comment of each script. The README's `## Scripts` section becomes a one-line-per-script roadmap pointing readers at the script files for details. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: fold bootstrap-c.sh into generate-fixtures.sh; centralize usage in print_usage() - Add `--compat trino|c` to generate-fixtures.sh: - `--compat trino` (default): existing Java generation behavior. - `--compat c`: download alamb/tpcds-data sfN with `git clone --depth 1` and extract into tests/fixtures/scale-N-c/. Supports `--rebuild` and `--verify`. Replaces bootstrap-c.sh, which is removed. - Per-script header is now a one-liner + "see print_usage() below for details"; print_usage() sits immediately after the header with the full usage block (flags, env vars, examples, exit codes). Renamed `usage` -> `print_usage` everywhere. - Update tpcdsgen/README.md, scripts/README.md, and the CI workflow (`tpcdsgen-conformance.yml`) to call generate-fixtures.sh --compat c instead of bootstrap-c.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: only run tpcdsgen conformance on tpcdsgen/ or .github/ changes Adds a `paths` filter to both the push and pull_request triggers so the suite no longer fires for unrelated changes (e.g. tpchgen-* edits, doc tweaks at the repo root). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: rename return_reasons.dst to return_reasons_trino.dst Pair the inherited Java/Trino distribution with the corrected C variant so both filenames advertise their compat mode at a glance: return_reasons_trino.dst <-- old return_reasons.dst (carries the "reason 30 missing, reason 31 duplicated" bug, kept for byte-for-byte Trino fixture stability) return_reasons_c.dst <-- corrected, used by --compat c Updates the embedded_data.rs lookup table, the distribution loader, and the doc-comment references in scaling.rs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: consolidate MD5SUMS into scale-N-{java,c}/ fixture dirs Move the canonical Java reference hashes from tests/fixtures/java/scale-{1,10}/MD5SUMS to tests/fixtures/scale-{1,10}-java/MD5SUMS and generate a fresh tests/fixtures/scale-1-c/MD5SUMS from the current alamb/tpcds-data sf1 download (post-regeneration). The old tests/fixtures/rust/scale-{1,10}/MD5SUMS files are removed: they were byte-identical to the Java set apart from dbgen_version, which contains a generation timestamp and is always excluded from comparison. The empty tests/fixtures/java/ parent directory is gone too. README references already use the new scale-N-java/ paths (from the earlier rename); no further doc updates were needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: use 'trino' (not 'java') consistently for the Trino TPC-DS port 'java' is ambiguous — there may be multiple Java TPC-DS implementations. The reference we target is specifically the Trino library, so name everything after it for clarity: - tests/fixtures/scale-N-java/ -> tests/fixtures/scale-N-trino/ - scripts/bootstrap-java.sh -> scripts/bootstrap-trino.sh - TPCDS_JAVA_REPO env var -> TPCDS_TRINO_REPO - JAVA_DIR / JAVA_REPO_URL vars -> TRINO_DIR / TRINO_REPO_URL - find_java_jar / clone_java_repo / build_java / test_java -> find_trino_jar / clone_trino_repo / build_trino / test_trino - CI artifact `test-fixtures-java` -> `test-fixtures-trino` - "Java fixture" log labels -> "Trino fixture" - Doc references throughout READMEs and script headers updated. Kept as-is: `actions/setup-java@v5`, `Java 11+` requirement, `java -jar` / `java -version` invocations, and `mvn`/`openjdk` references — those refer to the Java language/runtime, not the Trino implementation. The CLI flag and Rust `CompatMode::Trino` were already named `trino`; this commit aligns the rest of the codebase. Verified: `./scripts/test-all-tables.sh` passes 24/24 vs Trino, and `./scripts/test-all-tables.sh --compat c` passes 23/23 vs C dsdgen (customer.dat still skipped pending alamb/tpcds-data regeneration). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: add references to documented bugs --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 81b3c9b commit 5cde819

26 files changed

Lines changed: 1077 additions & 688 deletions

.github/workflows/tpcdsgen-conformance.yml

Lines changed: 63 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,17 +3,23 @@ name: TPC-DS Conformance
33
on:
44
push:
55
branches: [ main, master ]
6+
paths:
7+
- 'tpcdsgen/**'
8+
- '.github/**'
69
pull_request:
710
branches: [ main, master ]
11+
paths:
12+
- 'tpcdsgen/**'
13+
- '.github/**'
814

915
env:
1016
CARGO_TERM_COLOR: always
1117
RUST_BACKTRACE: 1
1218

1319
jobs:
14-
# Conformance testing against Java implementation
20+
# Conformance testing against the Java / Trino reference implementation.
1521
conformance-tests:
16-
name: Conformance Tests
22+
name: Conformance Tests (Java)
1723
runs-on: ubuntu-latest
1824

1925
steps:
@@ -45,7 +51,7 @@ jobs:
4551
- name: Bootstrap Java TPC-DS implementation
4652
run: |
4753
cd tpcdsgen
48-
./scripts/bootstrap-java.sh
54+
./scripts/bootstrap-trino.sh
4955
5056
- name: Build Rust table generators
5157
run: |
@@ -65,7 +71,60 @@ jobs:
6571
if: failure() # Upload fixtures if tests fail for debugging
6672
uses: actions/upload-artifact@v7
6773
with:
68-
name: test-fixtures
74+
name: test-fixtures-trino
75+
path: tpcdsgen/tests/fixtures/
76+
retention-days: 7
77+
78+
# Conformance testing against the C dsdgen reference implementation.
79+
#
80+
# Reference data is pre-generated and lives in
81+
# https://github.com/alamb/tpcds-data (branch sf1).
82+
# `generate-fixtures.sh --compat c` clones it with --depth 1 and extracts
83+
# into tpcdsgen/tests/fixtures/scale-1-c/. Rust is then run in
84+
# --compat c mode and the .dat output is compared byte-for-byte (MD5/diff).
85+
conformance-tests-c:
86+
name: Conformance Tests (C dsdgen)
87+
runs-on: ubuntu-latest
88+
89+
steps:
90+
- name: Checkout repository
91+
uses: actions/checkout@v6
92+
93+
- name: Install Rust toolchain
94+
uses: dtolnay/rust-toolchain@stable
95+
96+
- name: Cache Rust dependencies
97+
uses: actions/cache@v5
98+
with:
99+
path: |
100+
~/.cargo/bin/
101+
~/.cargo/registry/index/
102+
~/.cargo/registry/cache/
103+
~/.cargo/git/db/
104+
target/
105+
key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}
106+
restore-keys: |
107+
${{ runner.os }}-cargo-
108+
109+
- name: Download C dsdgen reference data
110+
run: |
111+
cd tpcdsgen
112+
./scripts/generate-fixtures.sh --compat c --scale 1
113+
114+
- name: Build Rust table generators
115+
run: |
116+
cargo build --release -p tpcdsgen
117+
118+
- name: Run conformance tests (Rust --compat c vs C dsdgen)
119+
run: |
120+
cd tpcdsgen
121+
./scripts/test-all-tables.sh --compat c
122+
123+
- name: Upload test fixtures as artifacts
124+
if: failure()
125+
uses: actions/upload-artifact@v7
126+
with:
127+
name: test-fixtures-c
69128
path: tpcdsgen/tests/fixtures/
70129
retention-days: 7
71130

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,5 @@ target/
33
__old/
44
Cargo.lock
55
.idea
6-
.venv/
6+
.venv/
7+
tpcds/

.gitmodules

Whitespace-only changes.

tpcdsgen/README.md

Lines changed: 28 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -29,75 +29,61 @@ Fixtures are pre-generated TPC-DS data files used for conformance testing.
2929

3030
```
3131
tests/fixtures/
32-
├── java/ # Java reference implementation output
33-
│ ├── scale-1/ # 25 tables, ~1.2GB
34-
│ └── scale-10/ # 25 tables, ~11GB
35-
└── rust/ # Rust implementation output
36-
├── scale-1/ # 25 tables, ~1.2GB
37-
└── scale-10/ # 25 tables, ~11GB
32+
├── scale-1-trino/ # Java reference fixtures (`--compat trino`)
33+
├── scale-1-c/ # C dsdgen reference fixtures (`--compat c`)
34+
└── scale-10-trino/ # higher scale factors as needed
3835
```
3936

40-
### Generating Java Fixtures
41-
42-
Requires the Java TPC-DS implementation to be built:
37+
### Conformance Testing
4338

44-
```bash
45-
# Build Java implementation (if not already built)
46-
cd ../tpcds && mvn clean package -DskipTests && cd -
47-
48-
# Generate Java fixtures for scale 1
49-
java -jar ../tpcds/target/tpcds-1.5-SNAPSHOT-jar-with-dependencies.jar \
50-
--scale 1 \
51-
--directory tests/fixtures/java/scale-1 \
52-
--overwrite
53-
54-
# Generate Java fixtures for scale 10
55-
java -jar ../tpcds/target/tpcds-1.5-SNAPSHOT-jar-with-dependencies.jar \
56-
--scale 10 \
57-
--directory tests/fixtures/java/scale-10 \
58-
--overwrite
59-
```
39+
`tpcdsgen` ships with two conformance suites, both implemented as shell
40+
scripts that do byte-for-byte (MD5) comparison of `.dat` output. See
41+
[scripts/README.md](scripts/README.md) for full details.
6042

61-
### Generating Rust Fixtures
43+
**vs. Java / Trino reference (default, `--compat trino`):**
6244

6345
```bash
64-
# Build Rust implementation
65-
cargo build --release
46+
# One-time: clone & build the Java TPC-DS implementation.
47+
./scripts/bootstrap-trino.sh
6648

67-
# Generate Rust fixtures for scale 1
68-
./target/release/tpcdsgen --scale 1 --directory tests/fixtures/rust/scale-1
49+
# Generate Java reference fixtures into tests/fixtures/scale-N-trino/.
50+
./scripts/generate-fixtures.sh
6951

70-
# Generate Rust fixtures for scale 10
71-
./target/release/tpcdsgen --scale 10 --directory tests/fixtures/rust/scale-10
52+
# Compare Rust output byte-for-byte against the Java fixtures.
53+
./scripts/test-all-tables.sh --scale 1
7254
```
7355

74-
### Conformance Testing
75-
76-
To verify Rust output matches Java byte-for-byte:
56+
**vs. C dsdgen reference (`--compat c`):**
7757

7858
```bash
79-
# Run conformance tests at scale 1
80-
./scripts/test-all-tables.sh --scale 1
59+
# One-time: download pre-generated C dsdgen data from
60+
# https://github.com/alamb/tpcds-data into tests/fixtures/scale-N-c/.
61+
./scripts/generate-fixtures.sh --compat c --scale 1
8162

82-
# Run conformance tests at scale 10
83-
./scripts/test-all-tables.sh --scale 10
63+
# Compare Rust --compat c output byte-for-byte against the C fixtures.
64+
./scripts/test-all-tables.sh --compat c --scale 1
8465
```
8566

86-
See [HASHES.md](HASHES.md) for the canonical MD5 hashes.
67+
Both suites also support comparing a single table:
68+
69+
```bash
70+
./scripts/compare-table.sh reason # vs. Java
71+
./scripts/compare-table.sh reason --compat c # vs. C dsdgen
72+
```
8773

8874
### Verifying Fixtures with MD5SUMS
8975

9076
Each fixture directory contains an `MD5SUMS` file for verification.
9177

9278
**On Linux:**
9379
```bash
94-
cd tests/fixtures/java/scale-1
80+
cd tests/fixtures/scale-1-trino
9581
md5sum -c MD5SUMS
9682
```
9783

9884
**On macOS:**
9985
```bash
100-
cd tests/fixtures/java/scale-1
86+
cd tests/fixtures/scale-1-trino
10187
while read hash file; do
10288
[[ $(md5 -q "$file") == "$hash" ]] && echo "$file: OK" || echo "$file: FAILED"
10389
done < MD5SUMS

tpcdsgen/data/return_reasons_c.dst

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
------
2+
-- return_reasons
3+
------
4+
-- values weights
5+
-- -----------------------
6+
-- 1. reason 1-6. not sure... none are ever used
7+
------
8+
Package was damaged: 1, 0, 0, 0, 0, 0
9+
Stopped working: 1, 0, 0, 0, 0, 0
10+
Did not get it on time: 1, 0, 0, 0, 0, 0
11+
Not the product that was ordred: 1, 0, 0, 0, 0, 0
12+
Parts missing: 1, 0, 0, 0, 0, 0
13+
Does not work with a product that I have: 1, 0, 0, 0, 0, 0
14+
Gift exchange: 1, 0, 0, 0, 0, 0
15+
Did not like the color: 1, 0, 0, 0, 0, 0
16+
Did not like the model: 1, 0, 0, 0, 0, 0
17+
Did not like the make: 1, 0, 0, 0, 0, 0
18+
Did not like the warranty: 1, 0, 0, 0, 0, 0
19+
No service location in my area: 1, 0, 0, 0, 0, 0
20+
Found a better price in a store: 1, 0, 0, 0, 0, 0
21+
Found a better extended warranty in a store: 1, 0, 0, 0, 0, 0
22+
Not working any more: 1, 0, 0, 0, 0, 0
23+
Did not fit: 1, 0, 0, 0, 0, 0
24+
Wrong size: 1, 0, 0, 0, 0, 0
25+
Lost my job: 1, 0, 0, 0, 0, 0
26+
unauthoized purchase: 1, 0, 0, 0, 0, 0
27+
duplicate purchase: 1, 0, 0, 0, 0, 0
28+
its is a boy: 1, 0, 0, 0, 0, 0
29+
it is a girl: 1, 0, 0, 0, 0, 0
30+
reason 23: 1, 0, 0, 0, 0, 0
31+
reason 24: 1, 0, 0, 0, 0, 0
32+
reason 25: 1, 0, 0, 0, 0, 0
33+
reason 26: 1, 0, 0, 0, 0, 0
34+
reason 27: 1, 0, 0, 0, 0, 0
35+
reason 28: 1, 0, 0, 0, 0, 0
36+
reason 29: 1, 0, 0, 0, 0, 0
37+
reason 30: 1, 0, 0, 0, 0, 0
38+
reason 31: 1, 0, 0, 0, 0, 0
39+
reason 32: 1, 0, 0, 0, 0, 0
40+
reason 33: 1, 0, 0, 0, 0, 0
41+
reason 34: 1, 0, 0, 0, 0, 0
42+
reason 35: 1, 0, 0, 0, 0, 0
43+
reason 36: 1, 1, 0, 0, 0, 0
44+
reason 37: 1, 1, 0, 0, 0, 0
45+
reason 38: 1, 1, 0, 0, 0, 0
46+
reason 39: 1, 1, 0, 0, 0, 0
47+
reason 40: 1, 1, 0, 0, 0, 0
48+
reason 41: 1, 1, 0, 0, 0, 0
49+
reason 42: 1, 1, 0, 0, 0, 0
50+
reason 43: 1, 1, 0, 0, 0, 0
51+
reason 44: 1, 1, 0, 0, 0, 0
52+
reason 45: 1, 1, 0, 0, 0, 0
53+
reason 46: 1, 1, 1, 0, 0, 0
54+
reason 47: 1, 1, 1, 0, 0, 0
55+
reason 48: 1, 1, 1, 0, 0, 0
56+
reason 49: 1, 1, 1, 0, 0, 0
57+
reason 50: 1, 1, 1, 0, 0, 0
58+
reason 51: 1, 1, 1, 0, 0, 0
59+
reason 52: 1, 1, 1, 0, 0, 0
60+
reason 53: 1, 1, 1, 0, 0, 0
61+
reason 54: 1, 1, 1, 0, 0, 0
62+
reason 55: 1, 1, 1, 0, 0, 0
63+
reason 56: 1, 1, 1, 1, 0, 0
64+
reason 57: 1, 1, 1, 1, 0, 0
65+
reason 58: 1, 1, 1, 1, 0, 0
66+
reason 59: 1, 1, 1, 1, 0, 0
67+
reason 60: 1, 1, 1, 1, 0, 0
68+
reason 61: 1, 1, 1, 1, 0, 0
69+
reason 62: 1, 1, 1, 1, 0, 0
70+
reason 63: 1, 1, 1, 1, 0, 0
71+
reason 64: 1, 1, 1, 1, 0, 0
72+
reason 65: 1, 1, 1, 1, 0, 0
73+
reason 66: 1, 1, 1, 1, 1, 0
74+
reason 67: 1, 1, 1, 1, 1, 0
75+
reason 68: 1, 1, 1, 1, 1, 0
76+
reason 69: 1, 1, 1, 1, 1, 0
77+
reason 70: 1, 1, 1, 1, 1, 0
78+
reason 71: 1, 1, 1, 1, 1, 1
79+
reason 72: 1, 1, 1, 1, 1, 1
80+
reason 73: 1, 1, 1, 1, 1, 1
81+
reason 74: 1, 1, 1, 1, 1, 1
82+
reason 75: 1, 1, 1, 1, 1, 1

0 commit comments

Comments
 (0)