Skip to content

Commit fbc3d2f

Browse files
authored
docs: remove references to native_datafusion and native_iceberg_compat scans (#4362)
1 parent 7203fb5 commit fbc3d2f

6 files changed

Lines changed: 51 additions & 87 deletions

File tree

docs/source/contributor-guide/adding_a_new_spark_version.md

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -138,9 +138,7 @@ own test suites under the new profile.
138138

139139
Promote the new Spark version from the compile-only job to the main test
140140
jobs in `.github/workflows/pr_build_linux.yml` (and `pr_build_macos.yml` if
141-
capacity allows). Use `scan_impl: "auto"` so both `native_datafusion` and
142-
`native_iceberg_compat` get exercised, matching how earlier versions are
143-
configured.
141+
capacity allows). Match how earlier versions are configured.
144142

145143
### Run the Suite Locally First
146144

@@ -256,14 +254,12 @@ new-version bring-up are:
256254
### CI for the Spark SQL Tests
257255

258256
Spark SQL tests do not run from the main PR build workflows. They have
259-
their own dedicated workflow files:
257+
their own dedicated workflow file:
260258

261259
- `.github/workflows/spark_sql_test.yml`
262-
- `.github/workflows/spark_sql_test_native_iceberg_compat.yml`
263260

264-
Add the new version to the matrix in each of these files (`spark-short`,
265-
`spark-full`, `java`, `scan-impl`). Use the closest existing entry as a
266-
template.
261+
Add the new version to the matrix (`spark-short`, `spark-full`, `java`).
262+
Use the closest existing entry as a template.
267263

268264
Before merging, run `make format`, run clippy
269265
(`cd native && cargo clippy --all-targets --workspace -- -D warnings`), and

docs/source/user-guide/latest/compatibility/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ Comet aims to provide consistent results with the version of Apache Spark that i
2323

2424
This guide documents areas where Comet's behavior is known to differ from Spark. Topics are grouped by subsystem:
2525

26-
- **Parquet**: limitations when reading Parquet files (both scan implementations, shared and per-implementation).
26+
- **Parquet**: limitations when reading Parquet files.
2727
- **Floating-point comparison**: NaN and signed-zero handling in comparisons.
2828
- **Regular expressions**: differences between the Rust regexp crate and Java's regex engine.
2929
- **Operators**: operator-level compatibility notes, including window functions and round-robin partitioning.

docs/source/user-guide/latest/compatibility/scans.md

Lines changed: 24 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -19,22 +19,13 @@ under the License.
1919

2020
# Parquet Compatibility
2121

22-
Comet currently has two distinct implementations of the Parquet scan operator.
22+
Comet's Parquet scan offloads decoding to native code and produces Arrow batches for the rest of
23+
the plan. Comet falls back to Spark when the scan cannot be converted (for example, due to one of
24+
the unsupported features listed below).
2325

24-
| Scan Implementation | Notes |
25-
| ----------------------- | ---------------------- |
26-
| `native_datafusion` | Fully native scan |
27-
| `native_iceberg_compat` | Hybrid JVM/native scan |
26+
## Parquet Scan Limitations
2827

29-
The configuration property `spark.comet.scan.impl` is used to select an implementation. The default setting is
30-
`spark.comet.scan.impl=auto`, which attempts to use `native_datafusion` first, and falls back to Spark if the scan
31-
cannot be converted (e.g., due to unsupported features). Most users should not need to change this setting. However,
32-
it is possible to force Comet to use a particular implementation for all scan operations by setting this
33-
configuration property to one of the following implementations. For example: `--conf spark.comet.scan.impl=native_datafusion`.
34-
35-
## Shared Limitations
36-
37-
The following features are not supported by either scan implementation, and Comet will fall back to Spark in these scenarios:
28+
The following features are not supported and cause Comet to fall back to Spark:
3829

3930
- Decimals encoded in binary format.
4031
- `ShortType` columns, by default. When reading Parquet files written by systems other than Spark that contain
@@ -46,17 +37,30 @@ The following features are not supported by either scan implementation, and Come
4637
columns are always safe because they can only come from signed `INT8`, where truncation preserves the signed value.
4738
- Default values that are nested types (e.g., maps, arrays, structs). Literal default values are supported.
4839
- Spark's Datasource V2 API. When `spark.sql.sources.useV1SourceList` does not include `parquet`, Spark uses the
49-
V2 API for Parquet scans. The DataFusion-based implementations only support the V1 API.
40+
V2 API for Parquet scans. Comet's Parquet scan only supports the V1 API.
5041
- Spark metadata columns (e.g., `_metadata.file_path`)
42+
- No support for row indexes
43+
- No support for `input_file_name()`, `input_file_block_start()`, or `input_file_block_length()` SQL functions.
44+
Comet's Parquet scan does not use Spark's `FileScanRDD`, so these functions cannot populate their values.
45+
- No support for `ignoreMissingFiles` or `ignoreCorruptFiles` being set to `true`
46+
- Duplicate field names in case-insensitive mode (e.g., a Parquet file with both `B` and `b` columns)
47+
are detected at read time and raise a `SparkRuntimeException` with error class `_LEGACY_ERROR_TEMP_2093`,
48+
matching Spark's behavior.
49+
- `spark.sql.parquet.enableVectorizedReader=false`. Disabling the vectorized reader opts into
50+
Spark's parquet-mr semantics (silent overflow, null-on-narrowing), which Comet's native reader
51+
does not replicate. By default Comet falls back to Spark in this case. Set
52+
`spark.comet.scan.allowDisabledParquetVectorizedReader=true` to opt in to running the
53+
Comet Parquet scan regardless. See
54+
[#4352](https://github.com/apache/datafusion-comet/issues/4352).
5155

52-
The following shared limitation may produce incorrect results without falling back to Spark:
56+
The following limitation may produce incorrect results without falling back to Spark:
5357

5458
- No support for datetime rebasing. When reading Parquet files containing dates or timestamps written before
5559
Spark 3.0 (which used a hybrid Julian/Gregorian calendar), dates/timestamps will be read as if they were
5660
written using the Proleptic Gregorian calendar. This may produce incorrect results for dates before
5761
October 15, 1582.
5862

59-
The following shared limitation raises an error at scan time rather than falling back to Spark:
63+
The following limitation raises an error at scan time rather than falling back to Spark:
6064

6165
- Invalid UTF-8 bytes in `STRING` columns. Spark permits arbitrary byte sequences in a `STRING`
6266
column (for example from `CAST(X'C1' AS STRING)`), but Comet's native execution path is built on
@@ -65,28 +69,7 @@ The following shared limitation raises an error at scan time rather than falling
6569
query, or cast the column to `BINARY` before persisting, if you need to preserve non-UTF-8 bytes.
6670
See [#4121](https://github.com/apache/datafusion-comet/issues/4121).
6771

68-
## `native_datafusion` Limitations
69-
70-
The `native_datafusion` scan has some additional limitations, mostly related to Parquet metadata. All of these
71-
cause Comet to fall back to Spark (including when using `auto` mode). Note that the `native_datafusion` scan
72-
requires `spark.comet.exec.enabled=true` because the scan node must be wrapped by `CometExecRule`.
73-
74-
- No support for row indexes
75-
- No support for reading Parquet field IDs
76-
- No support for `input_file_name()`, `input_file_block_start()`, or `input_file_block_length()` SQL functions.
77-
The `native_datafusion` scan does not use Spark's `FileScanRDD`, so these functions cannot populate their values.
78-
- No support for `ignoreMissingFiles` or `ignoreCorruptFiles` being set to `true`
79-
- Duplicate field names in case-insensitive mode (e.g., a Parquet file with both `B` and `b` columns)
80-
are detected at read time and raise a `SparkRuntimeException` with error class `_LEGACY_ERROR_TEMP_2093`,
81-
matching Spark's behavior.
82-
- `spark.sql.parquet.enableVectorizedReader=false`. Disabling the vectorized reader opts into
83-
Spark's parquet-mr semantics (silent overflow, null-on-narrowing), which Comet's native reader
84-
does not replicate. By default Comet falls back to Spark in this case. Set
85-
`spark.comet.scan.allowDisabledParquetVectorizedReader=true` to opt in to running the
86-
`native_datafusion` scan regardless. See
87-
[#4352](https://github.com/apache/datafusion-comet/issues/4352).
88-
89-
The following `native_datafusion` limitations may produce incorrect results on Spark versions prior to 4.0
72+
The following limitation may produce incorrect results on Spark versions prior to 4.0
9073
without falling back to Spark:
9174

9275
- Reading `TimestampLTZ` as `TimestampNTZ`. On Spark 3.x, Spark raises an error per
@@ -112,8 +95,8 @@ Schema mismatch happens in two real-world scenarios:
11295
table types at read time.
11396

11497
Spark's vectorized Parquet reader fully validates these conversions in `ParquetVectorUpdaterFactory.getUpdater`
115-
and throws `SchemaColumnConvertNotSupportedException` for unsupported pairs. `native_datafusion` mirrors
116-
that validation in its schema adapter; the entries below are the remaining gaps.
98+
and throws `SchemaColumnConvertNotSupportedException` for unsupported pairs. Comet's Parquet scan
99+
mirrors that validation in its schema adapter; the entries below are the remaining gaps.
117100

118101
Note that the exact set of accepted conversions has changed between Spark versions
119102
(for example, Spark 3.x's `schemaEvolution.enabled` flag gates `INT32 → INT64`, `FLOAT → DOUBLE`,
@@ -136,14 +119,3 @@ SchemaColumnConvertNotSupportedException`) instead of the one-level chain Spark'
136119
`SparkException` instead. Walk the cause chain to recover the
137120
`SchemaColumnConvertNotSupportedException`. Spark 4.0+ produces a single-level chain, matching
138121
vanilla Spark. See [#4354](https://github.com/apache/datafusion-comet/issues/4354).
139-
140-
## `native_iceberg_compat` Limitations
141-
142-
The `native_iceberg_compat` scan has the following additional limitation that may produce incorrect results
143-
without falling back to Spark:
144-
145-
- Some Spark configuration values are hard-coded to their defaults rather than respecting user-specified values.
146-
This may produce incorrect results when non-default values are set. The affected configurations are
147-
`spark.sql.parquet.binaryAsString`, `spark.sql.parquet.int96AsTimestamp`, `spark.sql.caseSensitive`,
148-
`spark.sql.parquet.inferTimestampNTZ.enabled`, and `spark.sql.legacy.parquet.nanosAsLong`. See
149-
[issue #1816](https://github.com/apache/datafusion-comet/issues/1816) for more details.

docs/source/user-guide/latest/compatibility/spark-versions.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -31,13 +31,13 @@ Spark 3.4.3 is supported with Java 11/17 and Scala 2.12/2.13.
3131
### Known Limitations
3232

3333
- **Reading `TimestampLTZ` as `TimestampNTZ`**: Spark 3.4 raises an error for this operation
34-
(SPARK-36182), but Comet's `native_datafusion` scan silently returns the raw UTC value instead.
35-
See [Parquet Compatibility](scans.md#native_datafusion-limitations) for details.
34+
(SPARK-36182), but Comet's Parquet scan silently returns the raw UTC value instead.
35+
See [Parquet Compatibility](scans.md#parquet-scan-limitations) for details.
3636

3737
- **Unsupported Parquet type conversions**: Spark 3.4 raises schema incompatibility errors for
3838
certain type mismatches (e.g., reading INT32 as BIGINT, decimal precision changes), but Comet's
39-
`native_datafusion` scan may not detect these and could return unexpected values.
40-
See [Parquet Compatibility](scans.md#native_datafusion-limitations) for details.
39+
Comet's Parquet scan may not detect these and could return unexpected values.
40+
See [Parquet Compatibility](scans.md#parquet-scan-limitations) for details.
4141

4242
## Spark 3.5
4343

@@ -46,13 +46,13 @@ Spark 3.5.8 is supported with Java 11/17 and Scala 2.12/2.13.
4646
### Known Limitations
4747

4848
- **Reading `TimestampLTZ` as `TimestampNTZ`**: Spark 3.5 raises an error for this operation
49-
(SPARK-36182), but Comet's `native_datafusion` scan silently returns the raw UTC value instead.
50-
See [Parquet Compatibility](scans.md#native_datafusion-limitations) for details.
49+
(SPARK-36182), but Comet's Parquet scan silently returns the raw UTC value instead.
50+
See [Parquet Compatibility](scans.md#parquet-scan-limitations) for details.
5151

5252
- **Unsupported Parquet type conversions**: Spark 3.5 raises schema incompatibility errors for
5353
certain type mismatches (e.g., reading INT32 as BIGINT, decimal precision changes), but Comet's
54-
`native_datafusion` scan may not detect these and could return unexpected values.
55-
See [Parquet Compatibility](scans.md#native_datafusion-limitations) for details.
54+
Comet's Parquet scan may not detect these and could return unexpected values.
55+
See [Parquet Compatibility](scans.md#parquet-scan-limitations) for details.
5656

5757
## Spark 4.0
5858

docs/source/user-guide/latest/datasources.md

Lines changed: 8 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -169,10 +169,11 @@ Or use `spark-shell` with HDFS support as described [above](#building-comet-with
169169

170170
## S3
171171

172-
The `native_datafusion` and `native_iceberg_compat` Parquet scan implementations completely offload data loading
173-
to native code. They use the [`object_store` crate](https://crates.io/crates/object_store) to read data from S3 and
174-
support configuring S3 access using standard [Hadoop S3A configurations](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#General_S3A_Client_configuration) by translating them to
175-
the `object_store` crate's format.
172+
Comet's Parquet scan completely offloads data loading to native code. It uses the
173+
[`object_store` crate](https://crates.io/crates/object_store) to read data from S3 and supports
174+
configuring S3 access using standard
175+
[Hadoop S3A configurations](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#General_S3A_Client_configuration)
176+
by translating them to the `object_store` crate's format.
176177

177178
This implementation maintains compatibility with existing Hadoop S3A configurations, so existing code will
178179
continue to work as long as the configurations are supported and can be translated without loss of functionality.
@@ -206,8 +207,7 @@ Multiple credential providers can be specified in a comma-separated list using t
206207

207208
### Additional S3 Configuration Options
208209

209-
Beyond credential providers, the `native_datafusion` and `native_iceberg_compat` implementations support additional
210-
S3 configuration options:
210+
Beyond credential providers, Comet's Parquet scan supports additional S3 configuration options:
211211

212212
| Option | Description |
213213
| ------------------------------- | -------------------------------------------------------------------------------------------------- |
@@ -220,8 +220,7 @@ All configuration options support bucket-specific overrides using the pattern `f
220220

221221
### Examples
222222

223-
The following examples demonstrate how to configure S3 access with the `native_datafusion` and `native_iceberg_compat`
224-
Parquet scan implementations using different authentication methods.
223+
The following examples demonstrate how to configure S3 access using different authentication methods.
225224

226225
**Example 1: Simple Credentials**
227226

@@ -230,7 +229,6 @@ This example shows how to access a private S3 bucket using an access key and sec
230229
```shell
231230
$SPARK_HOME/bin/spark-shell \
232231
...
233-
--conf spark.comet.scan.impl=native_datafusion \
234232
--conf spark.hadoop.fs.s3a.access.key=my-access-key \
235233
--conf spark.hadoop.fs.s3a.secret.key=my-secret-key
236234
...
@@ -243,7 +241,6 @@ This example demonstrates using an assumed role credential to access a private S
243241
```shell
244242
$SPARK_HOME/bin/spark-shell \
245243
...
246-
--conf spark.comet.scan.impl=native_datafusion \
247244
--conf spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider \
248245
--conf spark.hadoop.fs.s3a.assumed.role.arn=arn:aws:iam::123456789012:role/my-role \
249246
--conf spark.hadoop.fs.s3a.assumed.role.session.name=my-session \
@@ -253,7 +250,7 @@ $SPARK_HOME/bin/spark-shell \
253250

254251
### Limitations
255252

256-
The S3 support of `native_datafusion` and `native_iceberg_compat` has the following limitations:
253+
Comet's S3 support has the following limitations:
257254

258255
1. **Partial Hadoop S3A configuration support**: Not all Hadoop S3A configurations are currently supported. Only the configurations listed in the tables above are translated and applied to the underlying `object_store` crate.
259256

docs/source/user-guide/latest/understanding-comet-plans.md

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -145,13 +145,12 @@ by role. Names match what is shown in the plan output.
145145

146146
### Scans
147147

148-
| Node | Description |
149-
| ------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
150-
| `CometScan` | V1 Parquet scan driven by Spark's file-source path through Comet's Parquet reader. Decoding runs in native code; the resulting Arrow batches cross JNI into the native plan. The active scan implementation is shown in brackets, e.g. `CometScan [native_iceberg_compat]`. |
151-
| `CometBatchScan` | DataSource V2 scan, including Iceberg Parquet, that produces Arrow batches consumed by Comet. |
152-
| `CometNativeScan` | Fully native Parquet scan that runs entirely in DataFusion (no JVM Parquet reader involvement). |
153-
| `CometIcebergNativeScan` | Fully native Iceberg Parquet scan. |
154-
| `CometCsvNativeScan` | Fully native CSV scan (experimental). |
148+
| Node | Description |
149+
| ------------------------ | --------------------------------------------------------------------------------------------- |
150+
| `CometBatchScan` | DataSource V2 scan, including Iceberg Parquet, that produces Arrow batches consumed by Comet. |
151+
| `CometNativeScan` | Fully native Parquet scan that runs entirely in DataFusion. |
152+
| `CometIcebergNativeScan` | Fully native Iceberg Parquet scan. |
153+
| `CometCsvNativeScan` | Fully native CSV scan (experimental). |
155154

156155
### Native Execution Operators
157156

0 commit comments

Comments
 (0)