Skip to content

Support write executors for noop format DataFrame writes [databricks]#13314

Open
revans2 wants to merge 4 commits into
NVIDIA:mainfrom
revans2:feat-noop-format-support
Open

Support write executors for noop format DataFrame writes [databricks]#13314
revans2 wants to merge 4 commits into
NVIDIA:mainfrom
revans2:feat-noop-format-support

Conversation

@revans2

@revans2 revans2 commented Aug 14, 2025

Copy link
Copy Markdown
Collaborator

This is an experiment like #13234 was. This one used jules.google.com instead of copilot. This one also failed, but it got much closer to a workable solution and I was able to take it across the finish line.

This fixes #13074

Description

This adds in support for V2 writes for Spark 3.3.0+ for noop write support.

scala> spark.time(spark.range(1000000000L).write.format("noop").mode("overwrite").save())
Time taken: 1984 ms

scala> spark.time(spark.range(1000000000L).write.format("noop").mode("overwrite").save())
Time taken: 82 ms

scala> spark.time(spark.range(1000000000L).write.format("noop").mode("overwrite").save())
Time taken: 65 ms

scala> spark.time(spark.range(1000000000L).write.format("noop").mode("overwrite").save())
Time taken: 69 ms

scala> spark.conf.set("spark.rapids.sql.enabled", false)

scala> spark.time(spark.range(1000000000L).write.format("noop").mode("overwrite").save())
Time taken: 4228 ms                                                             

scala> spark.time(spark.range(1000000000L).write.format("noop").mode("overwrite").save())
Time taken: 1946 ms                                                             

scala> spark.time(spark.range(1000000000L).write.format("noop").mode("overwrite").save())
Time taken: 1809 ms                                                             

scala> spark.time(spark.range(1000000000L).write.format("noop").mode("overwrite").save())
Time taken: 1868 ms         

Checklists

This PR has:

  • added documentation for new or modified features or behaviors.
  • updated the license in the source code files when it is required.
  • added new tests or modified existing tests to cover new code paths.

Please select one of the following options:

  • Performance testing has been performed and its results are added in the PR description.
  • An issue is filed for performance testing and its link is added in the PR description. (Select this if performance testing will not be completed before the PR is submitted.)

google-labs-jules Bot and others added 3 commits August 13, 2025 19:35
This change adds support for writing to the "noop" format on the GPU.
This is implemented by adding GPU versions of `OverwriteByExpressionExec` and `AppendDataExec` that simply consume the input data without writing anything.

The following save modes are supported:
- overwrite
- append
- ignore
- errorifexists

Integration tests have been added to verify the new functionality.
Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
Signed-off-by: Robert (Bobby) Evans <bobby@apache.org>
Copilot AI review requested due to automatic review settings August 14, 2025 14:47

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for V2 write operations (AppendDataExec and OverwriteByExpressionExec) for Spark 3.3.0+ when using the noop format. The implementation only supports NoopWrite operations, which consume and discard data rather than writing to a real data source.

  • Adds GPU support for AppendDataExec and OverwriteByExpressionExec for Spark 3.3.0+
  • Implements GPU noop writer executors that consume and discard data
  • Updates support configuration files across all relevant Spark versions (3.3.0 to 3.5.6)

Reviewed Changes

Copilot reviewed 42 out of 42 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
Multiple supportedExecs.csv files Adds AppendDataExec and OverwriteByExpressionExec entries with support flags
Multiple operatorsScore.csv files Adds score entries (3.0) for the new executor types
Spark330PlusShims.scala Registers GPU rules for the new V2 write executors
NoopWriteMeta.scala Meta classes for tagging and converting V2 write operations
GpuNoopWriterExec.scala GPU implementations that consume and discard data
noop_write_test.py Integration test for noop write operations

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

val newExecs: Seq[(Class[_ <: SparkPlan], ExecRule[_ <: SparkPlan])] = Seq(
(overwriteByExpressionRule.getClassFor.asSubclass(classOf[SparkPlan]),
overwriteByExpressionRule),
(appendDataRule.getClassFor.asSubclass(classOf[SparkPlan]), appendDataRule)

Copilot AI Aug 14, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment explains why explicit casting is needed, but the implementation uses verbose type casting. Consider simplifying by using a helper method or extracting the pattern into a reusable function.

Suggested change
(appendDataRule.getClassFor.asSubclass(classOf[SparkPlan]), appendDataRule)
// Helper to cast the class to the required type
private def asSparkPlanClass[T <: SparkPlan](rule: ExecRule[T]): Class[_ <: SparkPlan] =
rule.getClassFor.asSubclass(classOf[SparkPlan])
val newExecs: Seq[(Class[_ <: SparkPlan], ExecRule[_ <: SparkPlan])] = Seq(
(asSparkPlanClass(overwriteByExpressionRule), overwriteByExpressionRule),
(asSparkPlanClass(appendDataRule), appendDataRule)

Copilot uses AI. Check for mistakes.
private val noopClassNames = Seq("org.apache.spark.sql.execution.datasources.noop.NoopWrite$")

def isNoopWrite(write: Write): Boolean = {
noopClassNames.contains(write.getClass.getName)

Copilot AI Aug 14, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using string-based class name comparison for detecting NoopWrite is fragile and could break if the class is renamed or moved. Consider using isInstanceOf or class comparison if possible.

Suggested change
noopClassNames.contains(write.getClass.getName)
private val noopWriteClassName = "org.apache.spark.sql.execution.datasources.noop.NoopWrite$"
def isNoopWrite(write: Write): Boolean = {
try {
val noopWriteClass = Class.forName(noopWriteClassName)
noopWriteClass.isInstance(write)
} catch {
case _: ClassNotFoundException => false
}

Copilot uses AI. Check for mistakes.

override def run(): Seq[InternalRow] = {
child match {
case g: GpuExec => g.executeColumnar().foreach(_.close())

Copilot AI Aug 14, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The pattern matching logic duplicates the consuming behavior. Consider extracting the consumption logic into a helper method to reduce duplication between run() and internalDoExecuteColumnar().

Suggested change
case g: GpuExec => g.executeColumnar().foreach(_.close())
case g: GpuExec => consumeAndCloseColumnarBatches(g.executeColumnar())

Copilot uses AI. Check for mistakes.
@revans2

revans2 commented Aug 14, 2025

Copy link
Copy Markdown
Collaborator Author

build

gerashegalov
gerashegalov previously approved these changes Aug 14, 2025

trait NoopWriteHelper {
// NoopTable is a private class, so we have to use reflection
private val noopClassNames = Seq("org.apache.spark.sql.execution.datasources.noop.NoopWrite$")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understandably not a performance issue since there is only one element, but logically it should be a Set

@revans2

revans2 commented Aug 15, 2025

Copy link
Copy Markdown
Collaborator Author

We did run into an error on databricks

[2025-08-14T20:20:38.961Z] FAILED ../../src/main/python/noop_write_test.py::test_noop_write[overwrite][DATAGEN_SEED=1755196199, TZ=UTC, INJECT_OOM] - pyspark.errors.exceptions.captured.IllegalArgumentException: Part of the plan is not columnar class org.apache.spark.sql.execution.datasources.v2.OverwriteByExpressionExec

[2025-08-14T20:20:38.962Z] OverwriteByExpression [num_affected_rows#167341L, num_inserted_rows#167342L], org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy$$Lambda$13584/1807377083@240167ff, org.apache.spark.sql.execution.datasources.noop.NoopWrite$@6baed76

[2025-08-14T20:20:38.962Z] +- Scan ExistingRDD[c1#167323L,c2#167324]

[2025-08-14T20:20:38.962Z] FAILED ../../src/main/python/noop_write_test.py::test_noop_write[append][DATAGEN_SEED=1755196199, TZ=UTC] - pyspark.errors.exceptions.captured.IllegalArgumentException: Part of the plan is not columnar class org.apache.spark.sql.execution.datasources.v2.AppendDataExec

[2025-08-14T20:20:38.962Z] AppendData [num_affected_rows#167359L, num_inserted_rows#167360L], org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strategy$$Lambda$13590/2037470699@17bf4075, org.apache.spark.sql.execution.datasources.noop.NoopWrite$@6baed76

It looks like we fell back to the CPU for the plan in some way, but xdist + pytest did not capture/print enough information in the test to say WHY it fell back. As this is a lower priority and mostly a test to see if AI could handle this functionality I am not inclined to spend a lot of time on it with databricks. So we could adjust the tests so that it is clear it is not supported on databricks and come back later with a follow on issue, or we could just keep the PR up as a starting point for others.

@revans2 revans2 changed the base branch from branch-25.10 to branch-25.12 September 29, 2025 17:58
@revans2 revans2 dismissed gerashegalov’s stale review September 29, 2025 17:58

The base branch was changed.

@pxLi

pxLi commented Nov 17, 2025

Copy link
Copy Markdown
Member

NOTE: release/25.12 has been created from main. Please retarget your PR to release/25.12 if it should be included in the release.

@nvauto

nvauto commented Jan 26, 2026

Copy link
Copy Markdown
Collaborator

NOTE: release/26.02 has been created from main. Please retarget your PR to release/26.02 if it should be included in the release.

@nvauto

nvauto commented Mar 30, 2026

Copy link
Copy Markdown
Collaborator

NOTE: release/26.04 has been created from main. Please retarget your PR to release/26.04 if it should be included in the release.

@nvauto

nvauto commented May 25, 2026

Copy link
Copy Markdown
Collaborator

NOTE: release/26.06 has been created from main. Please retarget your PR to release/26.06 if it should be included in the release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Support noop format

5 participants