chore: Use pre-generated C data files, unify comparison scripts#257
Merged
alamb merged 4 commits intoMay 12, 2026
Merged
Conversation
Move per-script docs (usage, flags, env vars, output, exit codes) into the top-of-file header comment of each script. The README's `## Scripts` section becomes a one-line-per-script roadmap pointing readers at the script files for details. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e in print_usage()
- Add `--compat trino|c` to generate-fixtures.sh:
- `--compat trino` (default): existing Java generation behavior.
- `--compat c`: download alamb/tpcds-data sfN with `git clone --depth 1`
and extract into tests/fixtures/scale-N-c/. Supports `--rebuild` and
`--verify`. Replaces bootstrap-c.sh, which is removed.
- Per-script header is now a one-liner + "see print_usage() below for
details"; print_usage() sits immediately after the header with the
full usage block (flags, env vars, examples, exit codes). Renamed
`usage` -> `print_usage` everywhere.
- Update tpcdsgen/README.md, scripts/README.md, and the CI workflow
(`tpcdsgen-conformance.yml`) to call generate-fixtures.sh --compat c
instead of bootstrap-c.sh.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a `paths` filter to both the push and pull_request triggers so the suite no longer fires for unrelated changes (e.g. tpchgen-* edits, doc tweaks at the repo root). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
@clflushopt I am taking the liberty of merging this consolidation directly into your branch so we can get it ready to go |
alamb
added a commit
that referenced
this pull request
May 14, 2026
* feat: dsdgen compatibility * chore: Use pre-generated C data files, unify comparison scripts (#257) * chore: Use pre-generated C data files, unify comparison scripts * chore: make scripts self-documenting; collapse scripts/README Move per-script docs (usage, flags, env vars, output, exit codes) into the top-of-file header comment of each script. The README's `## Scripts` section becomes a one-line-per-script roadmap pointing readers at the script files for details. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: fold bootstrap-c.sh into generate-fixtures.sh; centralize usage in print_usage() - Add `--compat trino|c` to generate-fixtures.sh: - `--compat trino` (default): existing Java generation behavior. - `--compat c`: download alamb/tpcds-data sfN with `git clone --depth 1` and extract into tests/fixtures/scale-N-c/. Supports `--rebuild` and `--verify`. Replaces bootstrap-c.sh, which is removed. - Per-script header is now a one-liner + "see print_usage() below for details"; print_usage() sits immediately after the header with the full usage block (flags, env vars, examples, exit codes). Renamed `usage` -> `print_usage` everywhere. - Update tpcdsgen/README.md, scripts/README.md, and the CI workflow (`tpcdsgen-conformance.yml`) to call generate-fixtures.sh --compat c instead of bootstrap-c.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci: only run tpcdsgen conformance on tpcdsgen/ or .github/ changes Adds a `paths` filter to both the push and pull_request triggers so the suite no longer fires for unrelated changes (e.g. tpchgen-* edits, doc tweaks at the repo root). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: rename return_reasons.dst to return_reasons_trino.dst Pair the inherited Java/Trino distribution with the corrected C variant so both filenames advertise their compat mode at a glance: return_reasons_trino.dst <-- old return_reasons.dst (carries the "reason 30 missing, reason 31 duplicated" bug, kept for byte-for-byte Trino fixture stability) return_reasons_c.dst <-- corrected, used by --compat c Updates the embedded_data.rs lookup table, the distribution loader, and the doc-comment references in scaling.rs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: consolidate MD5SUMS into scale-N-{java,c}/ fixture dirs Move the canonical Java reference hashes from tests/fixtures/java/scale-{1,10}/MD5SUMS to tests/fixtures/scale-{1,10}-java/MD5SUMS and generate a fresh tests/fixtures/scale-1-c/MD5SUMS from the current alamb/tpcds-data sf1 download (post-regeneration). The old tests/fixtures/rust/scale-{1,10}/MD5SUMS files are removed: they were byte-identical to the Java set apart from dbgen_version, which contains a generation timestamp and is always excluded from comparison. The empty tests/fixtures/java/ parent directory is gone too. README references already use the new scale-N-java/ paths (from the earlier rename); no further doc updates were needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * refactor: use 'trino' (not 'java') consistently for the Trino TPC-DS port 'java' is ambiguous — there may be multiple Java TPC-DS implementations. The reference we target is specifically the Trino library, so name everything after it for clarity: - tests/fixtures/scale-N-java/ -> tests/fixtures/scale-N-trino/ - scripts/bootstrap-java.sh -> scripts/bootstrap-trino.sh - TPCDS_JAVA_REPO env var -> TPCDS_TRINO_REPO - JAVA_DIR / JAVA_REPO_URL vars -> TRINO_DIR / TRINO_REPO_URL - find_java_jar / clone_java_repo / build_java / test_java -> find_trino_jar / clone_trino_repo / build_trino / test_trino - CI artifact `test-fixtures-java` -> `test-fixtures-trino` - "Java fixture" log labels -> "Trino fixture" - Doc references throughout READMEs and script headers updated. Kept as-is: `actions/setup-java@v5`, `Java 11+` requirement, `java -jar` / `java -version` invocations, and `mvn`/`openjdk` references — those refer to the Java language/runtime, not the Trino implementation. The CLI flag and Rust `CompatMode::Trino` were already named `trino`; this commit aligns the rest of the codebase. Verified: `./scripts/test-all-tables.sh` passes 24/24 vs Trino, and `./scripts/test-all-tables.sh --compat c` passes 23/23 vs C dsdgen (customer.dat still skipped pending alamb/tpcds-data regeneration). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: add references to documented bugs --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note: This is still very much WIP, not ready for review
This targets
C) #253I totally got nerd-sniped today trying to verify the C compat mode
Instead of downloading (slightly suspicious) parquet files, I think it would be better to use the raw, unencumbered output, from
dsgen(the original C program)