Iceberg 1.11 support for Spark 411, part (3/3): accelerate SparkIncrementalAppendScan; enable 4.1 iceberg CI

Chong Gao · Chong Gao · commit 085db633b4a8 · 2026-06-25T18:10:04.000+08:00
Iceberg 1.11 split SparkBatchQueryScan into a new class hierarchy and the incremental-append query path (.option("start-snapshot-id", ...) + .option("end-snapshot-id", ...)) moved into a brand-new class org.apache.iceberg.spark.source.SparkIncrementalAppendScan (package-private, 1.11-only, extends SparkRuntimeFilterableScan). Before 1.11 the same path went through SparkBatchQueryScan and was matched by the existing batch-query ScanRule in IcebergProviderBase, so without a rule for the new class the leaf falls back to CPU. Adds: - GpuSparkIncrementalAppendScan in org.apache.iceberg.spark.source — mirrors GpuSparkBatchQueryScan since both CPU scans extend SparkRuntimeFilterableScan (a SparkPartitioningAwareScan<PartitionScanTask>). Takes the public Scan type and reaches Iceberg internals only through the root-loadable GpuSparkScanAccess bridge + public SupportsRuntimeV2Filtering; it does NOT reference the package-private SparkIncrementalAppendScan directly, so it works under extraClassPath (system-classpath) where the Iceberg classes load in the app classloader and this shimmed class loads in Spark's MutableURLClassLoader (issue #14959). - iceberg111x.IcebergProviderImpl overrides getScans to register a third ScanRule for SparkIncrementalAppendScan on top of the base provider's two rules. The CPU class is loaded by string (ShimReflectionUtils.loadClass) because it is package-private. CI: enable the full Iceberg integration suite on Spark 4.1 / Iceberg 1.11.0. run_iceberg_tests() in jenkins/spark-tests.sh now handles Spark 4.1 (Scala 2.13, Iceberg 1.11.0), matching the run_iceberg_version_detect_tests() change in part (2/3) so the "must stay in sync" invariant between the two CI runners holds and the new 1.11 code paths are exercised by nightly CI. Review follow-ups deferred from part (2/3): - Assert the selected shim package, not just detectedVersion(): expose IcebergProvider.shimPackage() through IcebergProviderAccess and have iceberg_version_detection_test.py check it (e.g. 1.11.0 -> iceberg111x). A correct version paired with a wrong version->shim mapping now fails. - Add a Spark 4.1 version-detection smoke to pre-merge: ci_scala213() boots Spark 4.1 and runs the detection test so the iceberg111x packaging / shim-selection path is exercised before merge (the main suite runs on 4.0.1). - Narrow the run_iceberg_version_detect_tests() "must stay in sync" comment to reflect that the Spark 4.1 row is now covered in pre-merge. Signed-off-by: Chong Gao <res_life@163.com>
diff --git a/iceberg/iceberg-1-11-x/src/main/scala/com/nvidia/spark/rapids/iceberg/iceberg111x/IcebergProviderImpl.scala b/iceberg/iceberg-1-11-x/src/main/scala/com/nvidia/spark/rapids/iceberg/iceberg111x/IcebergProviderImpl.scala
@@ -16,6 +16,48 @@
 
 package com.nvidia.spark.rapids.iceberg.iceberg111x
 
+import scala.reflect.ClassTag
+import scala.util.Try
+
+import com.nvidia.spark.rapids.{GpuScan, ScanMeta, ScanRule, ShimReflectionUtils}
 import com.nvidia.spark.rapids.iceberg.IcebergProviderBase
+import org.apache.iceberg.spark.source.{GpuSparkIncrementalAppendScan, GpuSparkScan}
+
+import org.apache.spark.sql.connector.read.Scan
+
+class IcebergProviderImpl extends IcebergProviderBase {
+
+  /**
+   * Adds a {@code SparkIncrementalAppendScan} rule on top of the base provider's two rules
+   * ({@code SparkBatchQueryScan}, {@code SparkCopyOnWriteScan}). The incremental-append scan
+   * is a 1.11-only class — before 1.11 the same query path went through
+   * {@code SparkBatchQueryScan} and was matched by the base rule. The CPU class is loaded
+   * by string here because it is package-private and not directly referenceable from
+   * outside {@code org.apache.iceberg.spark.source}.
+   */
+  override def getScans: Map[Class[_ <: Scan], ScanRule[_ <: Scan]] = {
+    val cpuIncrementalAppendScanClass = ShimReflectionUtils.loadClass(
+      "org.apache.iceberg.spark.source.SparkIncrementalAppendScan")
+
+    val incrementalRule = new ScanRule[Scan](
+      (a, conf, p, r) => new ScanMeta[Scan](a, conf, p, r) {
+        private lazy val convertedScan: Try[GpuSparkScan] = Try(
+          GpuSparkIncrementalAppendScan.create(a, this.conf, false)
+            .asInstanceOf[GpuSparkScan])
+
+        override def supportsRuntimeFilters: Boolean = true
+
+        override def tagSelfForGpu(): Unit = {
+          GpuSparkScan.tagForGpu(this, convertedScan)
+        }
+
+        override def convertToGpu(): GpuScan = convertedScan.get
+      },
+      "Iceberg incremental append scan",
+      ClassTag(cpuIncrementalAppendScanClass)
+    )
 
-class IcebergProviderImpl extends IcebergProviderBase
+    super.getScans + (
+      cpuIncrementalAppendScanClass.asSubclass(classOf[Scan]) -> incrementalRule)
+  }
+}
diff --git a/iceberg/iceberg-1-11-x/src/main/scala/org/apache/iceberg/spark/source/GpuSparkIncrementalAppendScan.scala b/iceberg/iceberg-1-11-x/src/main/scala/org/apache/iceberg/spark/source/GpuSparkIncrementalAppendScan.scala
@@ -0,0 +1,97 @@
+/*
+ * Copyright (c) 2026, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.iceberg.spark.source
+
+import java.util.Objects
+
+import scala.collection.JavaConverters._
+
+import com.nvidia.spark.rapids.{GpuScan, RapidsConf}
+import org.apache.iceberg.expressions.Expression
+
+import org.apache.spark.sql.connector.expressions.NamedReference
+import org.apache.spark.sql.connector.expressions.filter.Predicate
+import org.apache.spark.sql.connector.read.{Scan, Statistics, SupportsRuntimeV2Filtering}
+
+/**
+ * GPU wrapper for Iceberg 1.11's {@code SparkIncrementalAppendScan} (the
+ * `.option("start-snapshot-id", ...)` / `.option("end-snapshot-id", ...)` read
+ * path). This class was introduced in Iceberg 1.11 — before 1.11 the same
+ * incremental-read path went through {@code SparkBatchQueryScan} (accelerated by
+ * {@link GpuSparkBatchQueryScan}). It mirrors {@link GpuSparkBatchQueryScan}
+ * because both CPU scans extend {@code SparkRuntimeFilterableScan} (a
+ * {@code SparkPartitioningAwareScan<PartitionScanTask>}).
+ *
+ * <p>Takes the public {@code Scan} type and reaches Iceberg internals only through
+ * the root-loadable {@link GpuSparkScanAccess} bridge and public Spark interfaces
+ * ({@code SupportsRuntimeV2Filtering}). It must NOT reference the package-private
+ * {@code SparkIncrementalAppendScan} directly: under {@code extraClassPath}
+ * (system-classpath) the Iceberg classes load in the app classloader while this
+ * shimmed class loads in Spark's MutableURLClassLoader, so any same-package access
+ * would fail with {@code IllegalAccessError} (see issue #14959).
+ */
+class GpuSparkIncrementalAppendScan(
+    override val cpuScan: Scan,
+    override val rapidsConf: RapidsConf,
+    override val queryUsesInputFile: Boolean) extends
+  GpuSparkPartitioningAwareScan[org.apache.iceberg.PartitionScanTask](
+    cpuScan, rapidsConf, queryUsesInputFile)
+  with SupportsRuntimeV2Filtering {
+
+  private val runtimeFilterExpressions: List[Expression] =
+    GpuSparkScanAccess.runtimeFilterExpressions(cpuScan)
+    .asScala
+    .toList
+
+  private def runtimeFilterScan: SupportsRuntimeV2Filtering =
+    cpuScan.asInstanceOf[SupportsRuntimeV2Filtering]
+
+  override def filterAttributes(): Array[NamedReference] = runtimeFilterScan.filterAttributes()
+
+  override def filter(predicates: Array[Predicate]): Unit = runtimeFilterScan.filter(predicates)
+
+  override def estimateStatistics(): Statistics = GpuSparkScanAccess.estimateStatistics(cpuScan)
+
+  override def equals(obj: Any): Boolean = obj match {
+    case that: GpuSparkIncrementalAppendScan =>
+      this.cpuScan == that.cpuScan &&
+        this.queryUsesInputFile == that.queryUsesInputFile
+    case _ => false
+  }
+
+  override def hashCode(): Int =
+    Objects.hash(cpuScan, Boolean.box(queryUsesInputFile))
+
+  override def toString: String =
+    s"GpuSparkIncrementalAppendScan(table=${GpuSparkScanAccess.table(cpuScan)}, " +
+      s"branch=${GpuSparkScanAccess.branch(cpuScan)}, " +
+      s"type=${GpuSparkScanAccess.expectedSchema(cpuScan).asStruct()}, " +
+      s"filters=${GpuSparkScanAccess.filterExpressions(cpuScan)}, " +
+      s"runtimeFilters=$runtimeFilterExpressions, " +
+      s"caseSensitive=${GpuSparkScanAccess.caseSensitive(cpuScan)}, " +
+      s"queryUseInputFile=$queryUsesInputFile)"
+
+  override def withInputFile(): GpuScan =
+    new GpuSparkIncrementalAppendScan(cpuScan, rapidsConf, true)
+}
+
+object GpuSparkIncrementalAppendScan {
+  /** Java-callable factory used by {@code IcebergProviderImpl}. Takes the public
+   *  {@code Scan} type — never the package-private {@code SparkIncrementalAppendScan}. */
+  def create(cpuScan: Scan, rapidsConf: RapidsConf, queryUsesInputFile: Boolean): GpuScan =
+    new GpuSparkIncrementalAppendScan(cpuScan, rapidsConf, queryUsesInputFile)
+}
diff --git a/integration_tests/src/main/python/iceberg/iceberg_version_detection_test.py b/integration_tests/src/main/python/iceberg/iceberg_version_detection_test.py
@@ -32,11 +32,24 @@ def test_iceberg_version_detection():
     if expected is None:
         pytest.skip("EXPECTED_ICEBERG_VERSION env var not set")
 
+    # Shim sub-package selected per Iceberg major.minor, e.g. 1.11.x -> iceberg111x.
+    # Mirrors IcebergProbeImpl.icebergVersionToShim.
+    major_minor = ".".join(expected.split(".")[:2])
+    expected_shim_package = \
+        "com.nvidia.spark.rapids.iceberg.iceberg{}x".format(major_minor.replace(".", ""))
+
     def check(spark):
-        jvm = spark.sparkContext._jvm
-        actual = jvm.com.nvidia.spark.rapids.iceberg.IcebergProviderAccess.detectedVersion()
+        access = spark.sparkContext._jvm.com.nvidia.spark.rapids.iceberg.IcebergProviderAccess
+        actual = access.detectedVersion()
         assert actual == expected, \
             "Iceberg version detection mismatch: expected '{}' on Spark {}, got '{}'".format(
                 expected, spark_version(), actual)
+        # Assert the shim package too: a correct detectedVersion() paired with a wrong
+        # version -> shim mapping (e.g. the 1.11 -> iceberg111x row) would otherwise
+        # still pass.
+        actual_shim_package = access.shimPackage()
+        assert actual_shim_package == expected_shim_package, \
+            "Iceberg shim package mismatch for {} on Spark {}: expected '{}', got '{}'".format(
+                expected, spark_version(), expected_shim_package, actual_shim_package)
 
     with_gpu_session(check)
diff --git a/jenkins/spark-premerge-build.sh b/jenkins/spark-premerge-build.sh
@@ -167,11 +167,11 @@ run_iceberg_version_detect_tests() {
         return 0
     fi
 
-    # Supported Iceberg versions per Spark version — must stay in sync with
-    # run_iceberg_tests() in spark-tests.sh. Note: the Spark 4.1 -> 1.11.0 row is
-    # listed here, but ci_scala213() currently uses SPARK_VER=4.0.1 so the 4.1 branch
-    # is not exercised in pre-merge CI. The full 4.1 integration suite is added in the
-    # stacked follow-up PR; until then, the 1.11.0 commit-ID mapping is nightly-only.
+    # Supported Iceberg versions per Spark version. The 3.5.x / 4.0.x rows mirror
+    # run_iceberg_tests() in spark-tests.sh. The Spark 4.1 -> 1.11.0 row is exercised
+    # in pre-merge by the dedicated Spark 4.1 version-detection smoke in ci_scala213()
+    # (which boots Spark 4.1); the full Spark 4.1 Iceberg integration suite runs in
+    # nightly via run_iceberg_tests().
     local iceberg_versions
     if [[ "$iceberg_spark_ver" == "4.1" ]]; then
         iceberg_versions="1.11.0"
@@ -269,6 +269,16 @@ ci_scala213() {
     # Moved out of spark-tests.sh DEFAULT mode where JDK 8 causes
     # UnsupportedClassVersionError for Iceberg 1.9+ runtime JARs.
     run_iceberg_version_detect_tests $SPARK_VER 2.13
+
+    # Spark 4.1 / Iceberg 1.11 shim-selection smoke. The main integration suite above
+    # runs on Spark 4.0.1, so without this the iceberg111x module and the 1.11.0 ->
+    # iceberg111x mapping added for Spark 4.1 would have no pre-merge coverage. Boot
+    # Spark 4.1 and assert both the detected version and the selected shim package.
+    local SPARK_VER_411=4.1.1
+    local buildver_411="${SPARK_VER_411//./}"
+    prepare_spark $SPARK_VER_411 2.13
+    $MVN -f scala2.13/ -U -B $MVN_URM_MIRROR -Dbuildver=$buildver_411 clean package $MVN_BUILD_ARGS -DskipTests=true
+    run_iceberg_version_detect_tests $SPARK_VER_411 2.13
 }
 
 prepare_spark() {
diff --git a/jenkins/spark-tests.sh b/jenkins/spark-tests.sh
@@ -268,7 +268,8 @@ run_iceberg_tests() {
   # get the patch version of Spark
   SPARK_PATCH_VER=$(echo "$SPARK_VER" | cut -d. -f3)
 
-  if [[ "$ICEBERG_SPARK_VER" != "3.5" && "$ICEBERG_SPARK_VER" != "4.0" ]]; then
+  if [[ "$ICEBERG_SPARK_VER" != "3.5" && "$ICEBERG_SPARK_VER" != "4.0" \
+        && "$ICEBERG_SPARK_VER" != "4.1" ]]; then
     echo "!!!! Skipping Iceberg tests. GPU acceleration of Iceberg is not supported on $ICEBERG_SPARK_VER"
     return 0
   fi
@@ -277,8 +278,15 @@ run_iceberg_tests() {
   # Spark 3.5.0-3.5.3 -> Iceberg 1.6.1
   # Spark 3.5.4+       -> Iceberg 1.9.2, 1.10.1
   # Spark 4.0.x        -> Iceberg 1.10.1
+  # Spark 4.1.x        -> Iceberg 1.11.0
   local supported_versions
-  if [[ "$ICEBERG_SPARK_VER" == "4.0" ]]; then
+  if [[ "$ICEBERG_SPARK_VER" == "4.1" ]]; then
+    if [[ "$SCALA_BINARY_VER" != "2.13" ]]; then
+      echo "!!!! Skipping Iceberg tests. Spark 4.1 Iceberg tests require Scala 2.13"
+      return 0
+    fi
+    supported_versions="1.11.0"
+  elif [[ "$ICEBERG_SPARK_VER" == "4.0" ]]; then
     if [[ "$SCALA_BINARY_VER" != "2.13" ]]; then
       echo "!!!! Skipping Iceberg tests. Spark 4.0 Iceberg tests require Scala 2.13"
       return 0
@@ -302,7 +310,9 @@ run_iceberg_tests() {
     echo "Using user-specified ICEBERG_VERSIONS=$ICEBERG_VERSIONS"
   else
     # Default: test one representative version per Spark patch range
-    if [[ "$ICEBERG_SPARK_VER" == "4.0" ]]; then
+    if [[ "$ICEBERG_SPARK_VER" == "4.1" ]]; then
+      ICEBERG_VERSIONS="1.11.0"
+    elif [[ "$ICEBERG_SPARK_VER" == "4.0" ]]; then
       ICEBERG_VERSIONS="1.10.1"
     elif [[ "$SPARK_PATCH_VER" -le 3 ]]; then
       ICEBERG_VERSIONS="1.6.1"
diff --git a/sql-plugin/src/main/java/com/nvidia/spark/rapids/iceberg/IcebergProviderAccess.java b/sql-plugin/src/main/java/com/nvidia/spark/rapids/iceberg/IcebergProviderAccess.java
@@ -24,4 +24,8 @@ private IcebergProviderAccess() {
   public static String detectedVersion() {
     return IcebergProvider$.MODULE$.detectedVersion();
   }
+
+  public static String shimPackage() {
+    return IcebergProvider$.MODULE$.shimPackage();
+  }
 }

Original file line number	Diff line number	Diff line change
`@@ -24,4 +24,8 @@ private IcebergProviderAccess() {`
`24`	`24`	`public static String detectedVersion() {`
`25`	`25`	`return IcebergProvider$.MODULE$.detectedVersion();`
`26`	`26`	`}`
	`27`	`+`
	`28`	`+ public static String shimPackage() {`
	`29`	`+ return IcebergProvider$.MODULE$.shimPackage();`
	`30`	`+ }`
`27`	`31`	`}`