Support write executors for noop format DataFrame writes [databricks] by revans2 · Pull Request #13314 · NVIDIA/cudf-spark

revans2 · 2025-08-14T14:47:10Z

This is an experiment like #13234 was. This one used jules.google.com instead of copilot. This one also failed, but it got much closer to a workable solution and I was able to take it across the finish line.

This fixes #13074

Description

This adds in support for V2 writes for Spark 3.3.0+ for noop write support.

scala> spark.time(spark.range(1000000000L).write.format("noop").mode("overwrite").save())
Time taken: 1984 ms

scala> spark.time(spark.range(1000000000L).write.format("noop").mode("overwrite").save())
Time taken: 82 ms

scala> spark.time(spark.range(1000000000L).write.format("noop").mode("overwrite").save())
Time taken: 65 ms

scala> spark.time(spark.range(1000000000L).write.format("noop").mode("overwrite").save())
Time taken: 69 ms

scala> spark.conf.set("spark.rapids.sql.enabled", false)

scala> spark.time(spark.range(1000000000L).write.format("noop").mode("overwrite").save())
Time taken: 4228 ms                                                             

scala> spark.time(spark.range(1000000000L).write.format("noop").mode("overwrite").save())
Time taken: 1946 ms                                                             

scala> spark.time(spark.range(1000000000L).write.format("noop").mode("overwrite").save())
Time taken: 1809 ms                                                             

scala> spark.time(spark.range(1000000000L).write.format("noop").mode("overwrite").save())
Time taken: 1868 ms

Checklists

This PR has:

added documentation for new or modified features or behaviors.
updated the license in the source code files when it is required.
added new tests or modified existing tests to cover new code paths.

Please select one of the following options:

Performance testing has been performed and its results are added in the PR description.
An issue is filed for performance testing and its link is added in the PR description. (Select this if performance testing will not be completed before the PR is submitted.)

This change adds support for writing to the "noop" format on the GPU. This is implemented by adding GPU versions of `OverwriteByExpressionExec` and `AppendDataExec` that simply consume the input data without writing anything. The following save modes are supported: - overwrite - append - ignore - errorifexists Integration tests have been added to verify the new functionality.

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

Copilot

Pull Request Overview

This PR adds support for V2 write operations (AppendDataExec and OverwriteByExpressionExec) for Spark 3.3.0+ when using the noop format. The implementation only supports NoopWrite operations, which consume and discard data rather than writing to a real data source.

Adds GPU support for AppendDataExec and OverwriteByExpressionExec for Spark 3.3.0+
Implements GPU noop writer executors that consume and discard data
Updates support configuration files across all relevant Spark versions (3.3.0 to 3.5.6)

Reviewed Changes

Copilot reviewed 42 out of 42 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
Multiple supportedExecs.csv files	Adds `AppendDataExec` and `OverwriteByExpressionExec` entries with support flags
Multiple operatorsScore.csv files	Adds score entries (3.0) for the new executor types
Spark330PlusShims.scala	Registers GPU rules for the new V2 write executors
NoopWriteMeta.scala	Meta classes for tagging and converting V2 write operations
GpuNoopWriterExec.scala	GPU implementations that consume and discard data
noop_write_test.py	Integration test for noop write operations

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-14T14:56:15Z

+    val newExecs: Seq[(Class[_ <: SparkPlan], ExecRule[_ <: SparkPlan])] = Seq(
+      (overwriteByExpressionRule.getClassFor.asSubclass(classOf[SparkPlan]),
+        overwriteByExpressionRule),
+      (appendDataRule.getClassFor.asSubclass(classOf[SparkPlan]), appendDataRule)


[nitpick] The comment explains why explicit casting is needed, but the implementation uses verbose type casting. Consider simplifying by using a helper method or extracting the pattern into a reusable function.

Suggested change

(appendDataRule.getClassFor.asSubclass(classOf[SparkPlan]), appendDataRule)

// Helper to cast the class to the required type

private def asSparkPlanClass[T <: SparkPlan](rule: ExecRule[T]): Class[_ <: SparkPlan] =

rule.getClassFor.asSubclass(classOf[SparkPlan])

val newExecs: Seq[(Class[_ <: SparkPlan], ExecRule[_ <: SparkPlan])] = Seq(

(asSparkPlanClass(overwriteByExpressionRule), overwriteByExpressionRule),

(asSparkPlanClass(appendDataRule), appendDataRule)

Copilot · 2025-08-14T14:56:15Z

+  private val noopClassNames = Seq("org.apache.spark.sql.execution.datasources.noop.NoopWrite$")
+
+  def isNoopWrite(write: Write): Boolean = {
+    noopClassNames.contains(write.getClass.getName)


Using string-based class name comparison for detecting NoopWrite is fragile and could break if the class is renamed or moved. Consider using isInstanceOf or class comparison if possible.

Suggested change

noopClassNames.contains(write.getClass.getName)

private val noopWriteClassName = "org.apache.spark.sql.execution.datasources.noop.NoopWrite$"

def isNoopWrite(write: Write): Boolean = {

try {

val noopWriteClass = Class.forName(noopWriteClassName)

noopWriteClass.isInstance(write)

} catch {

case _: ClassNotFoundException => false

}

Copilot · 2025-08-14T14:56:15Z

+
+  override def run(): Seq[InternalRow] = {
+    child match {
+      case g: GpuExec => g.executeColumnar().foreach(_.close())


[nitpick] The pattern matching logic duplicates the consuming behavior. Consider extracting the consumption logic into a helper method to reduce duplication between run() and internalDoExecuteColumnar().

Suggested change

case g: GpuExec => g.executeColumnar().foreach(_.close())

case g: GpuExec => consumeAndCloseColumnarBatches(g.executeColumnar())

revans2 · 2025-08-14T18:04:10Z

build

gerashegalov · 2025-08-14T18:30:46Z

+
+trait NoopWriteHelper {
+  // NoopTable is a private class, so we have to use reflection
+  private val noopClassNames = Seq("org.apache.spark.sql.execution.datasources.noop.NoopWrite$")


Understandably not a performance issue since there is only one element, but logically it should be a Set

revans2 · 2025-08-15T18:06:50Z

We did run into an error on databricks

[2025-08-14T20:20:38.961Z] FAILED ../../src/main/python/noop_write_test.py::test_noop_write[overwrite][DATAGEN_SEED=1755196199, TZ=UTC, INJECT_OOM] - pyspark.errors.exceptions.captured.IllegalArgumentException: Part of the plan is not columnar class org.apache.spark.sql.execution.datasources.v2.OverwriteByExpressionExec

[2025-08-14T20:20:38.962Z] OverwriteByExpression [num_affected_rows#167341L, num_inserted_rows#167342L], org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy$$Lambda$13584/1807377083@240167ff, org.apache.spark.sql.execution.datasources.noop.NoopWrite$@6baed76

[2025-08-14T20:20:38.962Z] +- Scan ExistingRDD[c1#167323L,c2#167324]

[2025-08-14T20:20:38.962Z] FAILED ../../src/main/python/noop_write_test.py::test_noop_write[append][DATAGEN_SEED=1755196199, TZ=UTC] - pyspark.errors.exceptions.captured.IllegalArgumentException: Part of the plan is not columnar class org.apache.spark.sql.execution.datasources.v2.AppendDataExec

[2025-08-14T20:20:38.962Z] AppendData [num_affected_rows#167359L, num_inserted_rows#167360L], org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy$$Lambda$13590/2037470699@17bf4075, org.apache.spark.sql.execution.datasources.noop.NoopWrite$@6baed76

It looks like we fell back to the CPU for the plan in some way, but xdist + pytest did not capture/print enough information in the test to say WHY it fell back. As this is a lower priority and mostly a test to see if AI could handle this functionality I am not inclined to spend a lot of time on it with databricks. So we could adjust the tests so that it is clear it is not supported on databricks and come back later with a follow on issue, or we could just keep the PR up as a starting point for others.

The base branch was changed.

pxLi · 2025-11-17T01:42:47Z

NOTE: release/25.12 has been created from main. Please retarget your PR to release/25.12 if it should be included in the release.

nvauto · 2026-01-26T02:08:53Z

NOTE: release/26.02 has been created from main. Please retarget your PR to release/26.02 if it should be included in the release.

nvauto · 2026-03-30T06:29:42Z

NOTE: release/26.04 has been created from main. Please retarget your PR to release/26.04 if it should be included in the release.

nvauto · 2026-05-25T05:03:54Z

NOTE: release/26.06 has been created from main. Please retarget your PR to release/26.06 if it should be included in the release.

google-labs-jules Bot and others added 3 commits August 13, 2025 19:35

Some fixes, but still not compiling

3ae7d61

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

Add in NOOP write support

a603aa3

Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>

Copilot AI review requested due to automatic review settings August 14, 2025 14:47

Copilot AI reviewed Aug 14, 2025

View reviewed changes

Missed some files for scala 2.13

748ec8c

gerashegalov previously approved these changes Aug 14, 2025

View reviewed changes

amahussein mentioned this pull request Sep 18, 2025

Support AppendDataExec in Qual tool NVIDIA/spark-rapids-tools#1919

Closed

revans2 changed the base branch from branch-25.10 to branch-25.12 September 29, 2025 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support write executors for noop format DataFrame writes [databricks]#13314

Support write executors for noop format DataFrame writes [databricks]#13314
revans2 wants to merge 4 commits into
NVIDIA:mainfrom
revans2:feat-noop-format-support

revans2 commented Aug 14, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 14, 2025

Uh oh!

Copilot AI Aug 14, 2025

Uh oh!

Copilot AI Aug 14, 2025

Uh oh!

revans2 commented Aug 14, 2025

Uh oh!

gerashegalov Aug 14, 2025

Uh oh!

revans2 commented Aug 15, 2025

Uh oh!

pxLi commented Nov 17, 2025

Uh oh!

nvauto commented Jan 26, 2026

Uh oh!

nvauto commented Mar 30, 2026

Uh oh!

nvauto commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

-      (appendDataRule.getClassFor.asSubclass(classOf[SparkPlan]), appendDataRule)
+    // Helper to cast the class to the required type
+    private def asSparkPlanClass[T <: SparkPlan](rule: ExecRule[T]): Class[_ <: SparkPlan] =
+      rule.getClassFor.asSubclass(classOf[SparkPlan])
+    val newExecs: Seq[(Class[_ <: SparkPlan], ExecRule[_ <: SparkPlan])] = Seq(
+      (asSparkPlanClass(overwriteByExpressionRule), overwriteByExpressionRule),
+      (asSparkPlanClass(appendDataRule), appendDataRule)

-    noopClassNames.contains(write.getClass.getName)
+  private val noopWriteClassName = "org.apache.spark.sql.execution.datasources.noop.NoopWrite$"
+  def isNoopWrite(write: Write): Boolean = {
+    try {
+      val noopWriteClass = Class.forName(noopWriteClassName)
+      noopWriteClass.isInstance(write)
+    } catch {
+      case _: ClassNotFoundException => false
+    }

	case g: GpuExec => g.executeColumnar().foreach(_.close())
	case g: GpuExec => consumeAndCloseColumnarBatches(g.executeColumnar())

Uh oh!

Conversation

revans2 commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklists

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

revans2 commented Aug 14, 2025

Uh oh!

gerashegalov Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

revans2 commented Aug 15, 2025

Uh oh!

pxLi commented Nov 17, 2025

Uh oh!

nvauto commented Jan 26, 2026

Uh oh!

nvauto commented Mar 30, 2026

Uh oh!

nvauto commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

revans2 commented Aug 14, 2025 •

edited

Loading