Skip to content

Commit 2cef606

Browse files
schenksjclaude
andcommitted
ci/docs: Delta CI workflow and documentation
- delta_spark_test.yml: CI workflow for Spark 3.4/3.5/4.0 matrix - delta.md: user guide (features, config, limitations, tuning) - delta-spark-tests.md: contributor guide for running Delta tests - datasources.md: add COMET_DELTA_NATIVE_ENABLED config reference Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 0a21380 commit 2cef606

4 files changed

Lines changed: 396 additions & 0 deletions

File tree

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one
2+
# or more contributor license agreements. See the NOTICE file
3+
# distributed with this work for additional information
4+
# regarding copyright ownership. The ASF licenses this file
5+
# to you under the Apache License, Version 2.0 (the
6+
# "License"); you may not use this file except in compliance
7+
# with the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing,
12+
# software distributed under the License is distributed on an
13+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
# KIND, either express or implied. See the License for the
15+
# specific language governing permissions and limitations
16+
# under the License.
17+
18+
name: Delta Lake Native Scan Tests
19+
20+
concurrency:
21+
group: ${{ github.repository }}-${{ github.head_ref || github.sha }}-${{ github.workflow }}
22+
cancel-in-progress: true
23+
24+
on:
25+
push:
26+
branches:
27+
- main
28+
paths-ignore:
29+
- "benchmarks/**"
30+
- "doc/**"
31+
- "docs/**"
32+
- "**.md"
33+
- "native/core/benches/**"
34+
- "native/spark-expr/benches/**"
35+
- "spark/src/test/scala/org/apache/spark/sql/benchmark/**"
36+
pull_request:
37+
paths-ignore:
38+
- "benchmarks/**"
39+
- "doc/**"
40+
- "docs/**"
41+
- "**.md"
42+
- "native/core/benches/**"
43+
- "native/spark-expr/benches/**"
44+
- "spark/src/test/scala/org/apache/spark/sql/benchmark/**"
45+
workflow_dispatch:
46+
47+
env:
48+
RUST_VERSION: stable
49+
RUST_BACKTRACE: 1
50+
51+
jobs:
52+
build-native:
53+
name: Build Native Library
54+
runs-on: ubuntu-24.04
55+
container:
56+
image: amd64/rust
57+
steps:
58+
- uses: actions/checkout@v6
59+
60+
- name: Setup Rust & Java toolchain
61+
uses: ./.github/actions/setup-builder
62+
with:
63+
rust-version: ${{ env.RUST_VERSION }}
64+
jdk-version: 17
65+
66+
- name: Restore Cargo cache
67+
uses: actions/cache/restore@v5
68+
with:
69+
path: |
70+
~/.cargo/registry
71+
~/.cargo/git
72+
native/target
73+
key: ${{ runner.os }}-cargo-ci-${{ hashFiles('native/**/Cargo.lock', 'native/**/Cargo.toml') }}-${{ hashFiles('native/**/*.rs') }}
74+
restore-keys: |
75+
${{ runner.os }}-cargo-ci-${{ hashFiles('native/**/Cargo.lock', 'native/**/Cargo.toml') }}-
76+
77+
- name: Build native library
78+
run: |
79+
cd native && cargo build --profile ci
80+
env:
81+
RUSTFLAGS: "-Ctarget-cpu=x86-64-v3"
82+
83+
- name: Save Cargo cache
84+
uses: actions/cache/save@v5
85+
if: github.ref == 'refs/heads/main'
86+
with:
87+
path: |
88+
~/.cargo/registry
89+
~/.cargo/git
90+
native/target
91+
key: ${{ runner.os }}-cargo-ci-${{ hashFiles('native/**/Cargo.lock', 'native/**/Cargo.toml') }}-${{ hashFiles('native/**/*.rs') }}
92+
93+
- name: Upload native library
94+
uses: actions/upload-artifact@v7
95+
with:
96+
name: native-lib-delta
97+
path: native/target/ci/libcomet.so
98+
retention-days: 1
99+
100+
delta-native-suite:
101+
needs: build-native
102+
strategy:
103+
matrix:
104+
os: [ubuntu-24.04]
105+
java-version: [17]
106+
spark-version:
107+
- {short: '3.4', full: '3.4.3'}
108+
- {short: '3.5', full: '3.5.8'}
109+
- {short: '4.0', full: '4.0.1'}
110+
fail-fast: false
111+
name: delta-native/${{ matrix.os }}/spark-${{ matrix.spark-version.full }}/java-${{ matrix.java-version }}
112+
runs-on: ${{ matrix.os }}
113+
container:
114+
image: amd64/rust
115+
env:
116+
SPARK_LOCAL_IP: localhost
117+
steps:
118+
- uses: actions/checkout@v6
119+
- name: Setup Rust & Java toolchain
120+
uses: ./.github/actions/setup-builder
121+
with:
122+
rust-version: ${{ env.RUST_VERSION }}
123+
jdk-version: ${{ matrix.java-version }}
124+
- name: Download native library
125+
uses: actions/download-artifact@v8
126+
with:
127+
name: native-lib-delta
128+
path: native/target/debug/
129+
- name: Run CometDeltaNativeSuite
130+
run: |
131+
./mvnw -Pspark-${{ matrix.spark-version.short }} -pl spark -am test \
132+
-Dsuites=org.apache.comet.CometDeltaNativeSuite \
133+
-Dmaven.gitcommitid.skip
Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
<!---
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
# Running Delta Lake Tests with Comet
21+
22+
## Comet's own Delta test suite
23+
24+
The primary test suite is `CometDeltaNativeSuite` with 34 test cases covering:
25+
26+
- Basic reads (unpartitioned, multi-file, partitioned, multi-column partition)
27+
- Projection pushdown, filter pushdown, predicate variety
28+
- Complex types (arrays, maps, structs, deeply nested)
29+
- Schema evolution, time travel (version + timestamp)
30+
- Deletion vectors (pre-DELETE acceleration + post-DELETE native DV filter)
31+
- Column mapping (id mode, name mode, rename)
32+
- Aggregation, joins, window functions, UNION, DISTINCT
33+
- NULL handling, case insensitivity, ORDER BY + LIMIT
34+
35+
### Running the suite
36+
37+
```bash
38+
# Build native library first
39+
cd native && cargo build && cd ..
40+
41+
# Run the Delta suite
42+
./mvnw -Pspark-3.5 -pl spark -am test \
43+
-Dsuites=org.apache.comet.CometDeltaNativeSuite \
44+
-Dmaven.gitcommitid.skip
45+
```
46+
47+
### Test dependencies
48+
49+
Delta tests require `io.delta:delta-spark_2.12:3.3.2` which is added as a
50+
test-scope dependency in `spark/pom.xml` under the `spark-3.5` profile. The
51+
dependency excludes Spark and Hadoop transitives so Comet's pinned versions
52+
stay authoritative.
53+
54+
### Test harness notes
55+
56+
The Delta test suite disables Spark's `DebugFilesystem` and Delta's test-only
57+
filename prefixes (`test%file%prefix-`, `test%dv%prefix-`) because
58+
delta-kernel-rs reads files by the names recorded in the transaction log, which
59+
don't include Spark/Delta's test prefixes. Production users are unaffected.
60+
61+
The suite also sets `spark.databricks.delta.deletionVectors.useMetadataRowIndex=false`
62+
to use Delta's older DV read strategy that inserts a `Project -> Filter` subtree
63+
(which Comet's plan rewrite can detect and strip), rather than the default
64+
metadata-row-index strategy that's opaque to plan-level rewriting.
65+
66+
## Benchmarks
67+
68+
### Micro-benchmark
69+
70+
```bash
71+
SPARK_GENERATE_BENCHMARK_FILES=1 make \
72+
benchmark-org.apache.spark.sql.benchmark.CometDeltaReadBenchmark
73+
```
74+
75+
Results are written to `spark/benchmarks/CometDeltaReadBenchmark-results.txt`.
76+
77+
### TPC-DS / TPC-H
78+
79+
1. Convert Parquet data to Delta:
80+
81+
```bash
82+
spark-submit \
83+
--packages io.delta:delta-spark_2.12:3.3.2 \
84+
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
85+
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
86+
benchmarks/tpc/create-delta-tables.py \
87+
--benchmark tpcds \
88+
--parquet-path /data/tpcds \
89+
--warehouse /data/delta-tpcds
90+
```
91+
92+
2. Run the benchmark:
93+
94+
```bash
95+
DELTA_JAR=/path/to/delta-spark.jar \
96+
COMET_JAR=/path/to/comet-spark.jar \
97+
DELTA_WAREHOUSE=/data/delta-tpcds \
98+
python benchmarks/tpc/tpcbench.py \
99+
--engine comet-delta \
100+
--benchmark tpcds
101+
```
102+
103+
Engine configs are in `benchmarks/tpc/engines/comet-delta.toml` and
104+
`benchmarks/tpc/engines/comet-delta-hashjoin.toml`.
105+
106+
## TPC-DS Plan Stability Fixtures
107+
108+
To generate the `q*.native_delta_compat/` plan stability golden files:
109+
110+
1. Add `CometConf.SCAN_NATIVE_DELTA_COMPAT` to the `scanImpls` list in
111+
`CometPlanStabilitySuite.scala` (line 66).
112+
113+
2. Ensure TPC-DS data exists as Delta tables (see `create-delta-tables.py`
114+
in the Benchmarks section above).
115+
116+
3. Generate golden files:
117+
118+
```bash
119+
SPARK_GENERATE_GOLDEN_FILES=1 ./mvnw -pl spark \
120+
-Dsuites=org.apache.spark.sql.comet.CometTPCDSV1_4_PlanStabilitySuite \
121+
test -Pspark-3.5 -Dmaven.gitcommitid.skip
122+
```
123+
124+
4. Commit the generated `q*.native_delta_compat/extended.txt` files under
125+
`spark/src/test/resources/tpcds-plan-stability/approved-plans-*`.

docs/source/user-guide/latest/datasources.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,14 @@ Comet accelerates Iceberg scans of Parquet files. See the [Iceberg Guide] for mo
3434

3535
[Iceberg Guide]: iceberg.md
3636

37+
### Delta Lake
38+
39+
Comet accelerates Delta Lake scans of Parquet files using
40+
[delta-kernel-rs](https://github.com/delta-io/delta-kernel-rs) for transaction log replay.
41+
See the [Delta Guide] for more information.
42+
43+
[Delta Guide]: delta.md
44+
3745
### CSV
3846

3947
Comet provides experimental native CSV scan support. When `spark.comet.scan.csv.v2.enabled` is enabled, CSV files

0 commit comments

Comments
 (0)