Iceberg 1.11 support for Spark 411, part (1/3): extract version-divergent scan APIs behind a shim by res-life · Pull Request #14881 · NVIDIA/spark-rapids

res-life · 2026-05-26T03:58:48Z

Stacked work for #14853 (1/3) — common-code preparation for adding iceberg-1-11-x.

Depends on

Fix Iceberg package-private access after shim isolation #14866 Rebase on this PR after Fix Iceberg package-private access after shim isolation #14866 merged.

Description

Refactors iceberg/common so the SparkScan / SparkCopyOnWriteScan / SparkBatch / DataWriteResult APIs that diverge between Iceberg 1.10.x and 1.11.x are hidden behind a small interface, with per-Iceberg-version implementations in iceberg-1-6-x / iceberg-1-9-x / iceberg-1-10-x. No behavior change for the Iceberg versions this PR ships; sets the stage for the follow-up PR that adds iceberg-1-11-x.

Common changes:

GpuSparkCopyOnWriteScan → renamed to GpuSparkCopyOnWriteScanBase (abstract). The runtime-filter trait + filter method live in a per-version concrete subclass (1.6/1.9/1.10 mix in SupportsRuntimeFiltering with filter(Filter[]); 1.11 will mix in SupportsRuntimeV2Filtering with filter(Predicate[])).
GpuSparkScan: rewrite hasNestedType via Spark's readSchema() + Spark types so it no longer depends on the 1.10-only cpuScan.expectedSchema(). Dispatch SparkCopyOnWriteScan construction through the new ShimUtils.newCopyOnWriteScan factory.
GpuSparkBatchQueryScan.toString uses cpuScan.description() (available in both 1.10 and 1.11) instead of branch / expectedSchema / filterExpressions (1.11 removed these).
GpuSparkBatchQueryScan.runtimeFilterExpressions reflective field-read tolerates both the 1.10 name (runtimeFilterExpressions) and the 1.11 name (runtimeFilters).
GpuSparkBatch: same tolerance for expectedSchema (1.10) vs projection (1.11).
GpuSparkWrite: type-annotate new Array[DataFile](0) so Scala 2.13 doesn't infer Array[Nothing] under 1.11's wildcarded DataWriteResult.dataFiles().
IcebergShimUtils / ShimUtils: add newCopyOnWriteScan(Scan, RapidsConf, Boolean): GpuScan factory. The parameter is Spark's public Scan because Iceberg's SparkCopyOnWriteScan is package-private — cross-package callers cannot reference it directly.

Per-Iceberg-version module changes (1.6 / 1.9 / 1.10, all identical for the V1 path):

New GpuSparkCopyOnWriteScan in org.apache.iceberg.spark.source (so it can reference the package-private SparkCopyOnWriteScan). Companion object exposes create(Scan, ...): GpuScan for cross-package callers.
ShimUtilsImpl.java implements newCopyOnWriteScan via GpuSparkCopyOnWriteScan.create.

The two try/catch field-name fallbacks (in GpuSparkBatchQueryScan and GpuSparkBatch) are tactical and will be pushed behind proper per-version IcebergShimUtils methods in a later cleanup PR.

Checklists

Documentation

Updated for new or modified user-facing features or behaviors
No user-facing change

Testing

Added or modified tests to cover new code paths
Covered by existing tests
(3.5.x + 4.0.x iceberg integration tests in `integration_tests/src/main/python/iceberg/` — exercises the new dispatch path with no behavior change vs. before this PR.)
Not required

Performance

Tests ran and results are added in the PR description
Issue filed with a link in the PR description
Not required

gerashegalov

This should ideally build on top of #14866

res-life · 2026-05-28T02:49:15Z

build

greptile-apps · 2026-05-28T02:54:54Z

Greptile Summary

This PR refactors the Iceberg integration's common layer to hide version-divergent scan APIs (SparkScan, SparkCopyOnWriteScan, SparkBatch, DataWriteResult) behind a per-version shim, preparing for Iceberg 1.11.x support without changing behavior for existing 1.6/1.9/1.10 users.

GpuSparkCopyOnWriteScan is split into an abstract GpuSparkCopyOnWriteScanBase (common) and per-version concrete subclasses (1.6/1.9/1.10) that mixin the appropriate SupportsRuntimeFiltering / SupportsRuntimeV2Filtering trait, dispatched via a new ShimUtils.newCopyOnWriteScan factory.
Tactical try/catch fallbacks handle renamed fields between Iceberg 1.10.x (runtimeFilterExpressions, expectedSchema) and 1.11.x (runtimeFilters, projection); hasNestedType is rewritten against Spark's readSchema() to avoid the 1.10-only cpuScan.expectedSchema(); Array[DataFile] type annotation is added in GpuSparkWrite to prevent Scala 2.13 Array[Nothing] inference under 1.11.x's wildcarded return type.

Confidence Score: 4/5

Safe to merge; no behavior change for existing Iceberg 1.6/1.9/1.10 users, and the new dispatch infrastructure is straightforward.

The refactoring is mechanically correct: the class hierarchy ensures the GpuSparkScan cast is safe, the hasNestedType rewrite maps Iceberg types to Spark equivalents correctly, and the Array[DataFile] type annotation fixes a real Scala 2.13 inference gap. The two tactical try/catch fallbacks (projection vs expectedSchema in GpuSparkBatch, runtimeFilters vs runtimeFilterExpressions in GpuSparkBatchQueryScan) each throw-and-catch an exception on every invocation for all current supported Iceberg versions, which is a minor hot-path overhead explicitly acknowledged in the PR as temporary pending a follow-up cleanup.

GpuSparkBatch and GpuSparkBatchQueryScan carry the tactical field-name fallbacks that should be moved to IcebergShimUtils in the follow-up PR.

Important Files Changed

Filename	Overview
iceberg/common/src/main/scala/org/apache/iceberg/spark/source/GpuSparkCopyOnWriteScanBase.scala	Renamed from GpuSparkCopyOnWriteScan to abstract base; removes SupportsRuntimeFiltering mixin and filter()/withInputFile() — both delegated to per-version subclasses. filterAttributes() retained without override keyword (correct, no parent interface at this level).
iceberg/common/src/main/scala/org/apache/iceberg/spark/source/GpuSparkScan.scala	hasNestedType rewritten from Iceberg schema types to Spark readSchema() types — mapping is semantically equivalent. SparkCopyOnWriteScan dispatch now goes through ShimUtils.newCopyOnWriteScan with a GpuSparkScan cast that is safe via the class hierarchy.
iceberg/common/src/main/scala/org/apache/iceberg/spark/source/GpuSparkBatchQueryScan.scala	Tactical try/catch probes runtimeFilters (1.11.x) then falls back to runtimeFilterExpressions (1.10.x); for current versions (1.6/1.9/1.10) this always throws-and-catches an IllegalArgumentException at object construction. toString simplified to use cpuScan.description().
iceberg/common/src/main/scala/org/apache/iceberg/spark/source/GpuSparkBatch.scala	planInputPartitions probes projection (1.11.x) then falls back to expectedSchema (1.10.x) via try/catch — same tactical approach as GpuSparkBatchQueryScan; always throws-and-catches for current Iceberg versions on every invocation.
iceberg/common/src/main/scala/org/apache/iceberg/spark/source/GpuSparkWrite.scala	Correct Scala 2.13 fix: explicit Array[DataFile] type annotation prevents Array[Nothing] inference under Iceberg 1.11.x's wildcarded DataWriteResult.dataFiles().
iceberg/iceberg-1-10-x/src/main/scala/org/apache/iceberg/spark/source/GpuSparkCopyOnWriteScan.scala	New per-version concrete class; correctly mixes in SupportsRuntimeFiltering, implements filter(Array[Filter]) and withInputFile(). Companion object create() safely downcasts Scan to SparkCopyOnWriteScan (guaranteed by caller's pattern match).
iceberg/iceberg-1-6-x/src/main/scala/org/apache/iceberg/spark/source/GpuSparkCopyOnWriteScan.scala	Identical in structure to the 1.10.x version; correctly placed in org.apache.iceberg.spark.source to access package-private SparkCopyOnWriteScan.
iceberg/iceberg-1-9-x/src/main/scala/org/apache/iceberg/spark/source/GpuSparkCopyOnWriteScan.scala	Identical to 1.6.x and 1.10.x versions; all three V1-path versions share the same SupportsRuntimeFiltering + filter(Array[Filter]) implementation.
iceberg/common/src/main/java/com/nvidia/spark/rapids/iceberg/IcebergShimUtils.java	New newCopyOnWriteScan interface method added with clear Javadoc explaining the cross-package indirection and version-divergence rationale.
iceberg/common/src/main/java/com/nvidia/spark/rapids/iceberg/ShimUtils.java	Static delegating wrapper for newCopyOnWriteScan; consistent with existing ShimUtils pattern.

Class Diagram

%%{init: {'theme': 'neutral'}}%%
classDiagram
    class GpuSparkScan {
        <<abstract>>
        +cpuScan: SparkScan
        +rapidsConf: RapidsConf
        +queryUsesInputFile: Boolean
        +hasNestedType() Boolean
        +readSchema() StructType
        +toBatch() Batch
    }

    class GpuSparkPartitioningAwareScan {
        <<abstract>>
        +outputPartitioning() Partitioning
        +groupingKeyType() Types.StructType
        +taskGroups() Seq
    }

    class GpuSparkBatchQueryScan {
        +cpuScan: SparkBatchQueryScan
        -runtimeFilterExpressions: List~Expression~
        +filter(predicates: Array~Predicate~)
        +withInputFile() GpuScan
    }

    class GpuSparkCopyOnWriteScanBase {
        <<abstract>>
        +cpuScan: SparkCopyOnWriteScan
        +filterAttributes() Array~NamedReference~
        +estimateStatistics() Statistics
    }

    class GpuSparkCopyOnWriteScan_16 {
        +filter(filters: Array~Filter~)
        +withInputFile() GpuScan
    }

    class GpuSparkCopyOnWriteScan_19 {
        +filter(filters: Array~Filter~)
        +withInputFile() GpuScan
    }

    class GpuSparkCopyOnWriteScan_110 {
        +filter(filters: Array~Filter~)
        +withInputFile() GpuScan
    }

    class ShimUtils {
        +newCopyOnWriteScan(Scan, RapidsConf, Boolean) GpuScan$
    }

    class IcebergShimUtils {
        <<interface>>
        +newCopyOnWriteScan(Scan, RapidsConf, Boolean) GpuScan
    }

    GpuSparkScan <|-- GpuSparkPartitioningAwareScan
    GpuSparkPartitioningAwareScan <|-- GpuSparkBatchQueryScan
    GpuSparkPartitioningAwareScan <|-- GpuSparkCopyOnWriteScanBase
    GpuSparkCopyOnWriteScanBase <|-- GpuSparkCopyOnWriteScan_16 : iceberg-1-6-x
    GpuSparkCopyOnWriteScanBase <|-- GpuSparkCopyOnWriteScan_19 : iceberg-1-9-x
    GpuSparkCopyOnWriteScanBase <|-- GpuSparkCopyOnWriteScan_110 : iceberg-1-10-x
    ShimUtils --> IcebergShimUtils : delegates
    IcebergShimUtils <|.. GpuSparkCopyOnWriteScan_16 : create()
    IcebergShimUtils <|.. GpuSparkCopyOnWriteScan_19 : create()
    IcebergShimUtils <|.. GpuSparkCopyOnWriteScan_110 : create()

_{Reviews (1): Last reviewed commit: "Iceberg: extract version-divergent scan ..." | Re-trigger Greptile}

Refactors iceberg/common so the {SparkScan, SparkBatchQueryScan, SparkCopyOnWriteScan, SparkBatch, DataWriteResult} APIs that diverge between Iceberg 1.10.x and 1.11.x are hidden behind a small interface, with per-version implementations in iceberg-1-6-x / iceberg-1-9-x / iceberg-1-10-x. No behavior change for the existing Iceberg versions this PR ships; sets the stage for a follow-up that adds iceberg-1-11-x. Common: - GpuSparkCopyOnWriteScan -> renamed to GpuSparkCopyOnWriteScanBase (abstract); per-version concrete subclass mixes in the right runtime- filter trait (SupportsRuntimeFiltering vs SupportsRuntimeV2Filtering) and the matching filter() signature. - GpuSparkScan: rewrite hasNestedType via Spark's readSchema() + Spark types so it no longer depends on the Iceberg 1.10-only cpuScan.expectedSchema(); dispatch SparkCopyOnWriteScan construction through ShimUtils.newCopyOnWriteScan. - GpuSparkBatchQueryScan: toString uses cpuScan.description() (public, available in both Iceberg 1.10 and 1.11) instead of branch / expectedSchema / filterExpressions which 1.11 removed. runtimeFilterExpressions field read tolerates both 1.10 name (runtimeFilterExpressions) and 1.11 name (runtimeFilters) — a tactical fallback to be replaced with proper per-version shim methods. - GpuSparkBatch: same tolerance for expectedSchema (1.10) vs projection (1.11). - GpuSparkWrite: type-annotate `new Array[DataFile](0)` so Scala 2.13 doesn't infer Array[Nothing] under 1.11's wildcarded DataWriteResult.dataFiles(). - IcebergShimUtils / ShimUtils: add newCopyOnWriteScan(Scan, ...) factory whose parameter is Spark's public Scan because Iceberg's SparkCopyOnWriteScan is package-private — cross-package callers cannot reference it directly. Per-Iceberg-version module: - New GpuSparkCopyOnWriteScan in org.apache.iceberg.spark.source (so it can reference the package-private SparkCopyOnWriteScan). Companion object exposes create(Scan, ...): GpuScan for cross-package callers. 1.6/1.9/1.10 mix in SupportsRuntimeFiltering + filter(Filter[]). - ShimUtilsImpl.java: implement newCopyOnWriteScan via GpuSparkCopyOnWriteScan.create. Signed-off-by: Chong Gao <res_life@163.com>

This was referenced May 26, 2026

Iceberg: add iceberg-1-11-x module wired to release411 (Spark 4.1) #14882

Draft

Iceberg 1.11: accelerate SparkIncrementalAppendScan on GPU #14883

Draft

gerashegalov reviewed May 27, 2026

View reviewed changes

res-life requested a review from a team May 28, 2026 02:48

res-life marked this pull request as ready for review May 28, 2026 02:49

res-life marked this pull request as draft May 28, 2026 02:59

res-life force-pushed the iceberg-1.11/pr1-common-shim branch from ffb4086 to 4647fc3 Compare May 29, 2026 03:11

res-life force-pushed the iceberg-1.11/pr1-common-shim branch from 4647fc3 to ac790ba Compare May 29, 2026 03:41

res-life changed the title ~~Iceberg: extract version-divergent scan APIs behind a shim~~ Iceberg 1.11 support for Spark 411, part (1/3): extract version-divergent scan APIs behind a shim May 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iceberg 1.11 support for Spark 411, part (1/3): extract version-divergent scan APIs behind a shim#14881

Iceberg 1.11 support for Spark 411, part (1/3): extract version-divergent scan APIs behind a shim#14881
res-life wants to merge 1 commit into
NVIDIA:mainfrom
res-life:iceberg-1.11/pr1-common-shim

res-life commented May 26, 2026 •

edited

Loading

Uh oh!

gerashegalov left a comment

Uh oh!

res-life commented May 28, 2026

Uh oh!

greptile-apps Bot commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

res-life commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Depends on

Description

Checklists

Uh oh!

gerashegalov left a comment

Choose a reason for hiding this comment

Uh oh!

res-life commented May 28, 2026

Uh oh!

greptile-apps Bot commented May 28, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Class Diagram

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

res-life commented May 26, 2026 •

edited

Loading