Skip to content

Iceberg 1.11 support for Spark 411, part (1/3): extract version-divergent scan APIs behind a shim#14881

Draft
res-life wants to merge 1 commit into
NVIDIA:mainfrom
res-life:iceberg-1.11/pr1-common-shim
Draft

Iceberg 1.11 support for Spark 411, part (1/3): extract version-divergent scan APIs behind a shim#14881
res-life wants to merge 1 commit into
NVIDIA:mainfrom
res-life:iceberg-1.11/pr1-common-shim

Conversation

@res-life
Copy link
Copy Markdown
Collaborator

@res-life res-life commented May 26, 2026

Stacked work for #14853 (1/3) — common-code preparation for adding iceberg-1-11-x.

Depends on

Description

Refactors iceberg/common so the SparkScan / SparkCopyOnWriteScan / SparkBatch / DataWriteResult APIs that diverge between Iceberg 1.10.x and 1.11.x are hidden behind a small interface, with per-Iceberg-version implementations in iceberg-1-6-x / iceberg-1-9-x / iceberg-1-10-x. No behavior change for the Iceberg versions this PR ships; sets the stage for the follow-up PR that adds iceberg-1-11-x.

Common changes:

  • GpuSparkCopyOnWriteScan → renamed to GpuSparkCopyOnWriteScanBase (abstract). The runtime-filter trait + filter method live in a per-version concrete subclass (1.6/1.9/1.10 mix in SupportsRuntimeFiltering with filter(Filter[]); 1.11 will mix in SupportsRuntimeV2Filtering with filter(Predicate[])).
  • GpuSparkScan: rewrite hasNestedType via Spark's readSchema() + Spark types so it no longer depends on the 1.10-only cpuScan.expectedSchema(). Dispatch SparkCopyOnWriteScan construction through the new ShimUtils.newCopyOnWriteScan factory.
  • GpuSparkBatchQueryScan.toString uses cpuScan.description() (available in both 1.10 and 1.11) instead of branch / expectedSchema / filterExpressions (1.11 removed these).
  • GpuSparkBatchQueryScan.runtimeFilterExpressions reflective field-read tolerates both the 1.10 name (runtimeFilterExpressions) and the 1.11 name (runtimeFilters).
  • GpuSparkBatch: same tolerance for expectedSchema (1.10) vs projection (1.11).
  • GpuSparkWrite: type-annotate new Array[DataFile](0) so Scala 2.13 doesn't infer Array[Nothing] under 1.11's wildcarded DataWriteResult.dataFiles().
  • IcebergShimUtils / ShimUtils: add newCopyOnWriteScan(Scan, RapidsConf, Boolean): GpuScan factory. The parameter is Spark's public Scan because Iceberg's SparkCopyOnWriteScan is package-private — cross-package callers cannot reference it directly.

Per-Iceberg-version module changes (1.6 / 1.9 / 1.10, all identical for the V1 path):

  • New GpuSparkCopyOnWriteScan in org.apache.iceberg.spark.source (so it can reference the package-private SparkCopyOnWriteScan). Companion object exposes create(Scan, ...): GpuScan for cross-package callers.
  • ShimUtilsImpl.java implements newCopyOnWriteScan via GpuSparkCopyOnWriteScan.create.

The two try/catch field-name fallbacks (in GpuSparkBatchQueryScan and GpuSparkBatch) are tactical and will be pushed behind proper per-version IcebergShimUtils methods in a later cleanup PR.

Checklists

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
    (3.5.x + 4.0.x iceberg integration tests in `integration_tests/src/main/python/iceberg/` — exercises the new dispatch path with no behavior change vs. before this PR.)
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

Copy link
Copy Markdown
Collaborator

@gerashegalov gerashegalov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should ideally build on top of #14866

@res-life res-life requested a review from a team May 28, 2026 02:48
@res-life res-life marked this pull request as ready for review May 28, 2026 02:49
@res-life
Copy link
Copy Markdown
Collaborator Author

build

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 28, 2026

Greptile Summary

This PR refactors the Iceberg integration's common layer to hide version-divergent scan APIs (SparkScan, SparkCopyOnWriteScan, SparkBatch, DataWriteResult) behind a per-version shim, preparing for Iceberg 1.11.x support without changing behavior for existing 1.6/1.9/1.10 users.

  • GpuSparkCopyOnWriteScan is split into an abstract GpuSparkCopyOnWriteScanBase (common) and per-version concrete subclasses (1.6/1.9/1.10) that mixin the appropriate SupportsRuntimeFiltering / SupportsRuntimeV2Filtering trait, dispatched via a new ShimUtils.newCopyOnWriteScan factory.
  • Tactical try/catch fallbacks handle renamed fields between Iceberg 1.10.x (runtimeFilterExpressions, expectedSchema) and 1.11.x (runtimeFilters, projection); hasNestedType is rewritten against Spark's readSchema() to avoid the 1.10-only cpuScan.expectedSchema(); Array[DataFile] type annotation is added in GpuSparkWrite to prevent Scala 2.13 Array[Nothing] inference under 1.11.x's wildcarded return type.

Confidence Score: 4/5

Safe to merge; no behavior change for existing Iceberg 1.6/1.9/1.10 users, and the new dispatch infrastructure is straightforward.

The refactoring is mechanically correct: the class hierarchy ensures the GpuSparkScan cast is safe, the hasNestedType rewrite maps Iceberg types to Spark equivalents correctly, and the Array[DataFile] type annotation fixes a real Scala 2.13 inference gap. The two tactical try/catch fallbacks (projection vs expectedSchema in GpuSparkBatch, runtimeFilters vs runtimeFilterExpressions in GpuSparkBatchQueryScan) each throw-and-catch an exception on every invocation for all current supported Iceberg versions, which is a minor hot-path overhead explicitly acknowledged in the PR as temporary pending a follow-up cleanup.

GpuSparkBatch and GpuSparkBatchQueryScan carry the tactical field-name fallbacks that should be moved to IcebergShimUtils in the follow-up PR.

Important Files Changed

Filename Overview
iceberg/common/src/main/scala/org/apache/iceberg/spark/source/GpuSparkCopyOnWriteScanBase.scala Renamed from GpuSparkCopyOnWriteScan to abstract base; removes SupportsRuntimeFiltering mixin and filter()/withInputFile() — both delegated to per-version subclasses. filterAttributes() retained without override keyword (correct, no parent interface at this level).
iceberg/common/src/main/scala/org/apache/iceberg/spark/source/GpuSparkScan.scala hasNestedType rewritten from Iceberg schema types to Spark readSchema() types — mapping is semantically equivalent. SparkCopyOnWriteScan dispatch now goes through ShimUtils.newCopyOnWriteScan with a GpuSparkScan cast that is safe via the class hierarchy.
iceberg/common/src/main/scala/org/apache/iceberg/spark/source/GpuSparkBatchQueryScan.scala Tactical try/catch probes runtimeFilters (1.11.x) then falls back to runtimeFilterExpressions (1.10.x); for current versions (1.6/1.9/1.10) this always throws-and-catches an IllegalArgumentException at object construction. toString simplified to use cpuScan.description().
iceberg/common/src/main/scala/org/apache/iceberg/spark/source/GpuSparkBatch.scala planInputPartitions probes projection (1.11.x) then falls back to expectedSchema (1.10.x) via try/catch — same tactical approach as GpuSparkBatchQueryScan; always throws-and-catches for current Iceberg versions on every invocation.
iceberg/common/src/main/scala/org/apache/iceberg/spark/source/GpuSparkWrite.scala Correct Scala 2.13 fix: explicit Array[DataFile] type annotation prevents Array[Nothing] inference under Iceberg 1.11.x's wildcarded DataWriteResult.dataFiles().
iceberg/iceberg-1-10-x/src/main/scala/org/apache/iceberg/spark/source/GpuSparkCopyOnWriteScan.scala New per-version concrete class; correctly mixes in SupportsRuntimeFiltering, implements filter(Array[Filter]) and withInputFile(). Companion object create() safely downcasts Scan to SparkCopyOnWriteScan (guaranteed by caller's pattern match).
iceberg/iceberg-1-6-x/src/main/scala/org/apache/iceberg/spark/source/GpuSparkCopyOnWriteScan.scala Identical in structure to the 1.10.x version; correctly placed in org.apache.iceberg.spark.source to access package-private SparkCopyOnWriteScan.
iceberg/iceberg-1-9-x/src/main/scala/org/apache/iceberg/spark/source/GpuSparkCopyOnWriteScan.scala Identical to 1.6.x and 1.10.x versions; all three V1-path versions share the same SupportsRuntimeFiltering + filter(Array[Filter]) implementation.
iceberg/common/src/main/java/com/nvidia/spark/rapids/iceberg/IcebergShimUtils.java New newCopyOnWriteScan interface method added with clear Javadoc explaining the cross-package indirection and version-divergence rationale.
iceberg/common/src/main/java/com/nvidia/spark/rapids/iceberg/ShimUtils.java Static delegating wrapper for newCopyOnWriteScan; consistent with existing ShimUtils pattern.

Class Diagram

%%{init: {'theme': 'neutral'}}%%
classDiagram
    class GpuSparkScan {
        <<abstract>>
        +cpuScan: SparkScan
        +rapidsConf: RapidsConf
        +queryUsesInputFile: Boolean
        +hasNestedType() Boolean
        +readSchema() StructType
        +toBatch() Batch
    }

    class GpuSparkPartitioningAwareScan {
        <<abstract>>
        +outputPartitioning() Partitioning
        +groupingKeyType() Types.StructType
        +taskGroups() Seq
    }

    class GpuSparkBatchQueryScan {
        +cpuScan: SparkBatchQueryScan
        -runtimeFilterExpressions: List~Expression~
        +filter(predicates: Array~Predicate~)
        +withInputFile() GpuScan
    }

    class GpuSparkCopyOnWriteScanBase {
        <<abstract>>
        +cpuScan: SparkCopyOnWriteScan
        +filterAttributes() Array~NamedReference~
        +estimateStatistics() Statistics
    }

    class GpuSparkCopyOnWriteScan_16 {
        +filter(filters: Array~Filter~)
        +withInputFile() GpuScan
    }

    class GpuSparkCopyOnWriteScan_19 {
        +filter(filters: Array~Filter~)
        +withInputFile() GpuScan
    }

    class GpuSparkCopyOnWriteScan_110 {
        +filter(filters: Array~Filter~)
        +withInputFile() GpuScan
    }

    class ShimUtils {
        +newCopyOnWriteScan(Scan, RapidsConf, Boolean) GpuScan$
    }

    class IcebergShimUtils {
        <<interface>>
        +newCopyOnWriteScan(Scan, RapidsConf, Boolean) GpuScan
    }

    GpuSparkScan <|-- GpuSparkPartitioningAwareScan
    GpuSparkPartitioningAwareScan <|-- GpuSparkBatchQueryScan
    GpuSparkPartitioningAwareScan <|-- GpuSparkCopyOnWriteScanBase
    GpuSparkCopyOnWriteScanBase <|-- GpuSparkCopyOnWriteScan_16 : iceberg-1-6-x
    GpuSparkCopyOnWriteScanBase <|-- GpuSparkCopyOnWriteScan_19 : iceberg-1-9-x
    GpuSparkCopyOnWriteScanBase <|-- GpuSparkCopyOnWriteScan_110 : iceberg-1-10-x
    ShimUtils --> IcebergShimUtils : delegates
    IcebergShimUtils <|.. GpuSparkCopyOnWriteScan_16 : create()
    IcebergShimUtils <|.. GpuSparkCopyOnWriteScan_19 : create()
    IcebergShimUtils <|.. GpuSparkCopyOnWriteScan_110 : create()
Loading

Reviews (1): Last reviewed commit: "Iceberg: extract version-divergent scan ..." | Re-trigger Greptile

@res-life res-life marked this pull request as draft May 28, 2026 02:59
@res-life res-life force-pushed the iceberg-1.11/pr1-common-shim branch from ffb4086 to 4647fc3 Compare May 29, 2026 03:11
Refactors iceberg/common so the {SparkScan, SparkBatchQueryScan,
SparkCopyOnWriteScan, SparkBatch, DataWriteResult} APIs that diverge
between Iceberg 1.10.x and 1.11.x are hidden behind a small interface,
with per-version implementations in iceberg-1-6-x / iceberg-1-9-x /
iceberg-1-10-x. No behavior change for the existing Iceberg versions
this PR ships; sets the stage for a follow-up that adds iceberg-1-11-x.

Common:
- GpuSparkCopyOnWriteScan -> renamed to GpuSparkCopyOnWriteScanBase
  (abstract); per-version concrete subclass mixes in the right runtime-
  filter trait (SupportsRuntimeFiltering vs SupportsRuntimeV2Filtering)
  and the matching filter() signature.
- GpuSparkScan: rewrite hasNestedType via Spark's readSchema() + Spark
  types so it no longer depends on the Iceberg 1.10-only
  cpuScan.expectedSchema(); dispatch SparkCopyOnWriteScan construction
  through ShimUtils.newCopyOnWriteScan.
- GpuSparkBatchQueryScan: toString uses cpuScan.description() (public,
  available in both Iceberg 1.10 and 1.11) instead of branch /
  expectedSchema / filterExpressions which 1.11 removed.
  runtimeFilterExpressions field read tolerates both 1.10 name
  (runtimeFilterExpressions) and 1.11 name (runtimeFilters) — a tactical
  fallback to be replaced with proper per-version shim methods.
- GpuSparkBatch: same tolerance for expectedSchema (1.10) vs projection
  (1.11).
- GpuSparkWrite: type-annotate `new Array[DataFile](0)` so Scala 2.13
  doesn't infer Array[Nothing] under 1.11's wildcarded
  DataWriteResult.dataFiles().
- IcebergShimUtils / ShimUtils: add newCopyOnWriteScan(Scan, ...) factory
  whose parameter is Spark's public Scan because Iceberg's
  SparkCopyOnWriteScan is package-private — cross-package callers cannot
  reference it directly.

Per-Iceberg-version module:
- New GpuSparkCopyOnWriteScan in org.apache.iceberg.spark.source (so it
  can reference the package-private SparkCopyOnWriteScan). Companion
  object exposes create(Scan, ...): GpuScan for cross-package callers.
  1.6/1.9/1.10 mix in SupportsRuntimeFiltering + filter(Filter[]).
- ShimUtilsImpl.java: implement newCopyOnWriteScan via
  GpuSparkCopyOnWriteScan.create.

Signed-off-by: Chong Gao <res_life@163.com>
@res-life res-life force-pushed the iceberg-1.11/pr1-common-shim branch from 4647fc3 to ac790ba Compare May 29, 2026 03:41
@res-life res-life changed the title Iceberg: extract version-divergent scan APIs behind a shim Iceberg 1.11 support for Spark 411, part (1/3): extract version-divergent scan APIs behind a shim May 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants