[BUG] Dedup GpuBroadcastExchange across DPP subqueries in non-AQE mode (#14837)

wjxiz1992 · claude · web-flow · commit 143e9caca812 · 2026-05-29T15:01:09.000+08:00
Closes #14833 ## Summary In non-AQE mode with DPP, `GpuSubqueryBroadcastExec` builds its underlying `GpuBroadcastExchangeExec` directly during `GpuOverrides` (the first pass) via `exMeta.convertToGpu()`, bypassing `GpuTransitionOverrides`. The DPP-side broadcast therefore ends up with a structurally different child (missing `GpuCoalesceBatches` that `insertCoalesce`/`optimizeCoalesce` add on the main plan) than the join-side broadcast for the same logical CPU exchange. The `cpuCanonical` field is also computed after GPU rewriting on the DPP side but before it on the join side, so `ExchangeMappingCache` lookup by `cpuCanonical` also fails to match. Spark's `ReuseExchangeAndSubquery` rule cannot merge the two broadcasts, so the dim side is materialized and broadcast twice — defeating DPP's intended performance benefit. This adds `fixupNonAdaptiveBroadcastReuse` to `GpuTransitionOverrides`, gated by a new `spark.rapids.sql.nonAqeBroadcastReuseFixup.enable` conf (internal, default true). The pass collects main-plan `GpuBroadcastExchangeExec` instances, indexes them by `(mode.canonicalized, stripTransitions(child).canonicalized)` (where `stripTransitions` removes `GpuCoalesceBatches` so the structural difference does not block matching), then walks subquery expressions and rewrites any matching DPP-side broadcast to `ReusedExchangeExec` referencing the main-plan instance. Mirrors `fixupAdaptiveExchangeReuse`, which already handles the equivalent gap in AQE mode. The cross-runtime case (CPU `BroadcastHashJoin` + GPU DPP subquery when `array`/`struct` build keys force CPU fallback) is tracked separately by #14836 and not addressed here. ## Traceability - Issue: #14833 (filed before this PR per the issue-first rule) - Follow-on (proper root-cause fix tracked separately): #14892 — apply `GpuTransitionOverrides` to `GpuSubqueryBroadcast`'s broadcast child so the post-hoc fixup added here can be retired. - Migration PR (independent, adds the end-to-end DPP suite that exposes this bug): #14781 - Related: #14836 (cross-runtime fallback case, distinct root cause) - Existing analogue this fix mirrors: `GpuTransitionOverrides.fixupAdaptiveExchangeReuse`. ## Testing ### Unit coverage (added in this PR) `tests/src/test/scala/com/nvidia/spark/rapids/NonAqeBroadcastReuseFixupSuite.scala` (added per @res-life's review request) directly exercises `fixupNonAdaptiveBroadcastReuse`. It hand-builds the #14833 structural divergence — a main-plan `GpuBroadcastExchangeExec(GpuCoalesceBatches(range))` and a DPP-side `GpuBroadcastExchangeExec(range)` sharing one dim-side `range` (so their broadcast modes canonicalize identically, but the children differ by exactly the `GpuCoalesceBatches` wrap): - `fixupNonAdaptiveBroadcastReuse rewrites matching DPP broadcast to ReusedExchangeExec` — once `stripGpuCoalesceBatches` normalizes the structural difference, the DPP-side broadcast is rewritten to a `ReusedExchangeExec` pointing at the main-plan instance. - `fixupNonAdaptiveBroadcastReuse leaves plans with no main-plan broadcast unchanged` — the `mainPlanBroadcasts.isEmpty` early-exit returns the input plan unmodified. - `ENABLE_NON_AQE_BROADCAST_REUSE_FIXUP conf accessor flips with the kill switch` — the accessor reflects the kill switch and defaults to `true` via `createWithDefault(true)`. ``` mvn package -pl tests -am -Dbuildver=330 -Dmaven.repo.local=./.mvn-repo \ -DwildcardSuites=com.nvidia.spark.rapids.NonAqeBroadcastReuseFixupSuite \ -Drapids.test.gpu.allocFraction=0.3 -Drapids.test.gpu.maxAllocFraction=0.3 \ -Drapids.test.gpu.minAllocFraction=0 -s jenkins/settings.xml -P mirror-apache-to-urm ``` → **`Tests: succeeded 3, failed 0`** (verified locally on Apache Spark 3.3, buildver 330) ### End-to-end (DPP suite from #14781, applied locally) Validated against Apache Spark 3.3 in a worktree, with the migrated `RapidsDynamicPartitionPruningV1Suite` from #14781 applied locally and the four `#14833` KNOWN_ISSUE excludes removed: ``` mvn package -pl tests -am -Dbuildver=330 -Dmaven.repo.local=./.mvn-repo \ -DwildcardSuites=org.apache.spark.sql.rapids.suites.RapidsDynamicPartitionPruningV1SuiteAEOff,org.apache.spark.sql.rapids.suites.RapidsDynamicPartitionPruningV1SuiteAEOn \ -Drapids.test.gpu.allocFraction=0.3 -Drapids.test.gpu.maxAllocFraction=0.3 \ -Drapids.test.gpu.minAllocFraction=0 -s jenkins/settings.xml -P mirror-apache-to-urm ``` - `RapidsDynamicPartitionPruningV1SuiteAEOff`: **`Tests: succeeded 34, failed 1`** (the one failure is `SPARK-32659` — known partial-fallback case from #14836, unrelated to this fix). The four previously-failing tests now pass: - `avoid reordering broadcast join keys to match input hash partitioning` - `Plan broadcast pruning only when the broadcast can be reused` - `SPARK-32817: DPP throws error when the broadcast side is empty` - `SPARK-38148: Do not add dynamic partition pruning if there exists static partition pruning` - `RapidsDynamicPartitionPruningV1SuiteAEOn`: **`Tests: succeeded 31, failed 0`** — no AQE regression. - `BroadcastHashJoinSuite` smoke: `Tests: succeeded 2, failed 0`. ### Cross-shim compile (per CLAUDE.md shim coverage rule) - `mvn package -DskipTests -pl sql-plugin -am -Dbuildver=330`: BUILD SUCCESS - `mvn package -DskipTests -pl sql-plugin -am -Dbuildver=340`: BUILD SUCCESS - `mvn package -DskipTests -pl sql-plugin -am -Dbuildver=400` (via `scala2.13/`): BUILD SUCCESS `scripts/check-shim-coverage.sh`: no shim files changed, no `Origin.context` leaks. ## Performance impact Cold-path analysis. The pass runs once at the end of `GpuTransitionOverrides` per query plan: - For queries without `GpuBroadcastExchangeExec` (the vast majority), the pass exits early after a single `SparkPlan.foreach` traversal (no `transformAllExpressions` invocation). - For queries with broadcasts and DPP, the work is O(n) plan traversal plus O(m) canonical computation on small subquery trees, where m is the size of the DPP subquery (typically a few nodes). This is negligible compared to query planning overhead and execution time. No benchmark required. ## Test recovery follow-up The fix has direct unit coverage in this PR (`NonAqeBroadcastReuseFixupSuite`). The end-to-end DPP recovery additionally depends on #14781, which adds the migrated `RapidsDynamicPartitionPruningV1Suite`. After both this fix and #14781 merge, a small follow-up PR removes the four `#14833` KNOWN_ISSUE excludes in `RapidsTestSettings.scala` so the migrated suite exercises the fix in CI. The two PRs are independent but both required for end-to-end recovery. Documentation - [ ] Updated for new or modified user-facing features or behaviors - [x] No user-facing change Testing - [x] Added or modified tests to cover new code paths - [ ] Covered by existing tests (Please provide the names of the existing tests in the PR description.) - [ ] Not required Performance - [ ] Tests ran and results are added in the PR description - [ ] Issue filed with a link in the PR description - [x] Not required --------- Signed-off-by: Allen Xu <allxu@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTransitionOverrides.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuTransitionOverrides.scala
@@ -38,7 +38,7 @@ import org.apache.spark.sql.execution.datasources.v2.{DataSourceV2ScanExecBase,
 import org.apache.spark.sql.execution.exchange.{BroadcastExchangeLike, Exchange, ReusedExchangeExec, ShuffleExchangeLike}
 import org.apache.spark.sql.execution.joins.{BroadcastHashJoinExec, BroadcastNestedLoopJoinExec}
 import org.apache.spark.sql.rapids.{GpuDataSourceScanExec, GpuFileSourceScanExec, GpuShuffleEnv, GpuTaskMetrics}
-import org.apache.spark.sql.rapids.execution.{ExchangeMappingCache, GpuBroadcastExchangeExec, GpuBroadcastExchangeExecBase, GpuBroadcastToRowExec, GpuCustomShuffleReaderExec, GpuHashJoin, GpuShuffleExchangeExecBase}
+import org.apache.spark.sql.rapids.execution.{ExchangeMappingCache, GpuBroadcastExchangeExec, GpuBroadcastExchangeExecBase, GpuBroadcastToRowExec, GpuCustomShuffleReaderExec, GpuHashJoin, GpuShuffleExchangeExecBase, GpuSubqueryBroadcastExec}
 import org.apache.spark.sql.types.StructType
 
 /**
@@ -759,6 +759,77 @@ class GpuTransitionOverrides extends Rule[SparkPlan] {
     }
   }
 
+  /**
+   * In non-AQE mode with DPP, GpuSubqueryBroadcastExec builds its underlying
+   * GpuBroadcastExchangeExec directly in GpuOverrides without going through this rule.
+   * The DPP-side broadcast therefore has a structurally different child (e.g. missing
+   * GpuCoalesceBatches that this rule inserts on the main plan) than the join-side
+   * broadcast for the same logical CPU exchange, and its cpuCanonical is also computed
+   * after GPU rewriting so it does not match the join-side cpuCanonical. Spark's
+   * ReuseExchangeAndSubquery rule does not merge them, so the dim side is materialized
+   * twice and DPP loses its intended performance benefit.
+   *
+   * This pass walks subquery expressions in the final plan, identifies the DPP-side
+   * GpuBroadcastExchangeExec inside a GpuSubqueryBroadcastExec, and matches it against
+   * the main-plan GpuBroadcastExchangeExec instances by (mode, child canonical form with
+   * GpuCoalesceBatches stripped — see stripGpuCoalesceBatches below). When a match is found,
+   * the DPP-side broadcast is rewritten to ReusedExchangeExec referencing the join-side
+   * instance.
+   */
+  private[rapids] def fixupNonAdaptiveBroadcastReuse(p: SparkPlan): SparkPlan = {
+    // Normalize a plan for signature matching by removing GpuCoalesceBatches wraps. The main-plan
+    // broadcast picks these up from insertCoalesce / optimizeCoalesce but the DPP-side broadcast
+    // (built earlier in GpuOverrides without going through GpuTransitionOverrides) does not, so
+    // we have to strip them on both sides before comparing canonical forms. This is the only
+    // structural difference observed in practice; other transitions (host->device, etc.) live
+    // outside the broadcast subtree and never reach this helper.
+    def stripGpuCoalesceBatches(plan: SparkPlan): SparkPlan = plan match {
+      case g: GpuCoalesceBatches => stripGpuCoalesceBatches(g.child)
+      case other => other.withNewChildren(other.children.map(stripGpuCoalesceBatches))
+    }
+
+    def signature(g: GpuBroadcastExchangeExec): (Any, SparkPlan) =
+      (g.mode.canonicalized, stripGpuCoalesceBatches(g.child).canonicalized)
+
+    // Collect all main-plan GpuBroadcastExchangeExec instances. SparkPlan.foreach only walks
+    // the plan-tree children and does NOT descend into ExecSubqueryExpression plans, so
+    // DPP-side broadcasts (which live inside GpuSubqueryBroadcastExec under a subquery
+    // expression) are naturally excluded from this collection — exactly what we want, because
+    // those are the instances the transformAllExpressions pass below will rewrite.
+    val mainPlanBroadcasts = mutable.ArrayBuffer.empty[GpuBroadcastExchangeExec]
+    p.foreach {
+      case g: GpuBroadcastExchangeExec => mainPlanBroadcasts += g
+      case _ =>
+    }
+    if (mainPlanBroadcasts.isEmpty) return p
+
+    val bySig = mainPlanBroadcasts.groupBy(signature).map {
+      case (sig, instances) => sig -> instances.head
+    }
+
+    p.transformAllExpressions {
+      case sub: ExecSubqueryExpression if sub.plan.isInstanceOf[GpuSubqueryBroadcastExec] =>
+        val gsb = sub.plan.asInstanceOf[GpuSubqueryBroadcastExec]
+        gsb.child match {
+          case dpp: GpuBroadcastExchangeExec =>
+            bySig.get(signature(dpp)) match {
+              case Some(matched) if !(matched eq dpp) =>
+                // Use dpp.output (not matched.output) so the reused exchange exposes the
+                // DPP-side attributes that downstream subquery expressions reference, while
+                // reading from the matched main-plan exchange. The AQE fixup
+                // fixupAdaptiveExchangeReuse uses the same shape (its g.output is the
+                // DPP-side attributes from the in-pass collected map).
+                val reused = ReusedExchangeExec(dpp.output, matched)
+                val newGsb = gsb.withNewChildren(Seq(reused))
+                  .asInstanceOf[GpuSubqueryBroadcastExec]
+                sub.withNewPlan(newGsb)
+              case _ => sub
+            }
+          case _ => sub
+        }
+    }
+  }
+
   private def insertStageLevelMetrics(plan: SparkPlan): Unit = {
     val sc = SparkSession.active.sparkContext
     val gen = new AtomicInteger(0)
@@ -850,6 +921,10 @@ class GpuTransitionOverrides extends Rule[SparkPlan] {
             plan.conf.adaptiveExecutionEnabled && plan.conf.exchangeReuseEnabled) {
           updatedPlan = fixupAdaptiveExchangeReuse(updatedPlan)
         }
+        if (rapidsConf.isNonAqeBroadcastReuseFixupEnabled &&
+            !plan.conf.adaptiveExecutionEnabled && plan.conf.exchangeReuseEnabled) {
+          updatedPlan = fixupNonAdaptiveBroadcastReuse(updatedPlan)
+        }
 
         if (rapidsConf.isTagLoreIdEnabled) {
           updatedPlan = GpuLore.tagForLore(updatedPlan, rapidsConf)
diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
@@ -2748,6 +2748,18 @@ val SHUFFLE_COMPRESSION_LZ4_CHUNK_SIZE = conf("spark.rapids.shuffle.compression.
       .booleanConf
       .createWithDefault(true)
 
+  val ENABLE_NON_AQE_BROADCAST_REUSE_FIXUP =
+    conf("spark.rapids.sql.nonAqeBroadcastReuseFixup.enable")
+      .doc("Option to turn on the fixup of broadcast exchange reuse for DPP " +
+          "subqueries when AQE is disabled. The DPP-side GpuBroadcastExchange is built " +
+          "during GpuOverrides and bypasses GpuTransitionOverrides, so it does not match " +
+          "the join-side broadcast canonically. This fixup builds a per-query signature map " +
+          "of join-side GpuBroadcastExchangeExec nodes in the main plan and rewrites a " +
+          "matching DPP-side broadcast to ReusedExchangeExec.")
+      .internal()
+      .booleanConf
+      .createWithDefault(true)
+
   val CHUNKED_PACK_POOL_SIZE = conf("spark.rapids.sql.chunkedPack.poolSize")
       .doc("Amount of GPU memory (in bytes) to set aside at startup for the chunked pack " +
            "scratch space, needed during spill from GPU to host memory. As a rule of thumb, each " +
@@ -4012,6 +4024,9 @@ class RapidsConf(conf: Map[String, String]) extends Logging {
 
   lazy val isAqeExchangeReuseFixupEnabled: Boolean = get(ENABLE_AQE_EXCHANGE_REUSE_FIXUP)
 
+  lazy val isNonAqeBroadcastReuseFixupEnabled: Boolean =
+    get(ENABLE_NON_AQE_BROADCAST_REUSE_FIXUP)
+
   lazy val chunkedPackPoolSize: Long = get(CHUNKED_PACK_POOL_SIZE)
 
   lazy val chunkedPackBounceBufferSize: Long = get(CHUNKED_PACK_BOUNCE_BUFFER_SIZE)
diff --git a/tests/src/test/scala/com/nvidia/spark/rapids/NonAqeBroadcastReuseFixupSuite.scala b/tests/src/test/scala/com/nvidia/spark/rapids/NonAqeBroadcastReuseFixupSuite.scala
@@ -0,0 +1,178 @@
+/*
+ * Copyright (c) 2026, NVIDIA CORPORATION.
+ *
+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package com.nvidia.spark.rapids
+
+import org.apache.spark.SparkConf
+import org.apache.spark.sql.catalyst.expressions.NamedExpression
+import org.apache.spark.sql.catalyst.plans.logical.Range
+import org.apache.spark.sql.execution.{FilterExec, RangeExec, SparkPlan}
+import org.apache.spark.sql.execution.{InSubqueryExec => SparkInSubqueryExec}
+import org.apache.spark.sql.execution.exchange.{BroadcastExchangeExec, ReusedExchangeExec}
+import org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode
+import org.apache.spark.sql.rapids.execution.{GpuBroadcastExchangeExec, GpuSubqueryBroadcastExec}
+
+/**
+ * Unit coverage for `GpuTransitionOverrides.fixupNonAdaptiveBroadcastReuse`, the non-AQE
+ * counterpart of `fixupAdaptiveExchangeReuse` already exercised by `ReusedExchangeFixupSuite`.
+ *
+ * The non-AQE fixup only fires for a `GpuBroadcastExchangeExec` that lives inside a
+ * `GpuSubqueryBroadcastExec` referenced by an `ExecSubqueryExpression` (DPP). A faithful
+ * test plan therefore needs to wire up both the main-plan broadcast and a hand-built DPP
+ * subquery expression.
+ */
+class NonAqeBroadcastReuseFixupSuite extends SparkQueryCompareTestSuite {
+
+  private val baseConf: SparkConf = new SparkConf()
+      .set("spark.sql.adaptive.enabled", "false")
+      .set("spark.sql.exchange.reuse", "true")
+      .set(RapidsConf.ENABLE_NON_AQE_BROADCAST_REUSE_FIXUP.key, "true")
+
+  private def newRange(): RangeExec = RangeExec(Range(1, 2, 1, Some(1)))
+
+  /** Wrap a plan in `GpuCoalesceBatches` so the signature must `stripGpuCoalesceBatches`. */
+  private def coalesce(child: SparkPlan): GpuCoalesceBatches =
+    GpuCoalesceBatches(child, TargetSize(1L << 20))
+
+  /**
+   * Build a `GpuBroadcastExchangeExec` over `child`. The broadcast mode is keyed on
+   * `leafRange.output`, so callers can wrap `child` in `GpuCoalesceBatches` without affecting
+   * the mode keys — exactly mirroring real DPP, where the dim-side attributes are pinned by
+   * the join's build expression while only the main-plan broadcast's child gets a coalesce
+   * wrap from `insertCoalesce` / `optimizeCoalesce`.
+   */
+  private def newGpuBroadcast(child: SparkPlan, leafRange: RangeExec): GpuBroadcastExchangeExec = {
+    val mode = HashedRelationBroadcastMode(leafRange.output)
+    GpuBroadcastExchangeExec(mode, child)(BroadcastExchangeExec(mode, child))
+  }
+
+  private def newGpuSubqueryBroadcast(
+      child: GpuBroadcastExchangeExec,
+      leafRange: RangeExec): GpuSubqueryBroadcastExec = {
+    val keys = Seq(leafRange.output.head)
+    GpuSubqueryBroadcastExec("dpp", Seq(0), keys, child)(modeKeys = Some(keys))
+  }
+
+  /**
+   * Build a synthetic non-AQE DPP plan that mirrors the actual #14833 structural divergence:
+   *
+   *   FilterExec(
+   *     condition = InSubqueryExec(plan = GpuSubqueryBroadcastExec(child = dppG)),
+   *     child     = mainG  // GpuBroadcastExchangeExec(child = GpuCoalesceBatches(range))
+   *   )
+   *
+   * The dim-side `range` is shared between `mainG` and `dppG` so their broadcast modes
+   * (keyed on `range.output`) canonicalize identically — matching how real DPP wires both
+   * broadcasts off the same logical filter sub-plan. The children diverge structurally
+   * only in the `GpuCoalesceBatches` wrap that `insertCoalesce` / `optimizeCoalesce`
+   * applies on the main-plan side:
+   *   - `mainG.child` = `GpuCoalesceBatches(range)`
+   *   - `dppG.child`  = `range`
+   *
+   * Without the production `stripGpuCoalesceBatches` normalization, the canonical comparison
+   * fails because the two children differ by exactly that wrap. With the normalization, both
+   * reduce to `range.canonicalized`, the signatures match, and the DPP-side broadcast is
+   * rewritten to `ReusedExchangeExec` pointing at `mainG`.
+   */
+  private def buildDppPlan(): (SparkPlan, GpuBroadcastExchangeExec, GpuBroadcastExchangeExec) = {
+    val range = newRange()
+    val mainG = newGpuBroadcast(coalesce(range), range)
+    val dppG = newGpuBroadcast(range, range)
+    val gsb = newGpuSubqueryBroadcast(dppG, range)
+    val inSub = SparkInSubqueryExec(range.output.head, gsb, NamedExpression.newExprId)
+    (FilterExec(inSub, mainG), mainG, dppG)
+  }
+
+  private def reusedExchangeChildren(p: SparkPlan): Seq[SparkPlan] = {
+    val collected = scala.collection.mutable.ArrayBuffer.empty[SparkPlan]
+    p.foreach { node =>
+      node.expressions.foreach(_.foreach {
+        case sub: SparkInSubqueryExec =>
+          sub.plan match {
+            case gsb: GpuSubqueryBroadcastExec =>
+              gsb.child match {
+                case r: ReusedExchangeExec => collected += r.child
+                case _ =>
+              }
+            case _ =>
+          }
+        case _ =>
+      })
+    }
+    collected.toSeq
+  }
+
+  test("fixupNonAdaptiveBroadcastReuse rewrites matching DPP broadcast to ReusedExchangeExec") {
+    withGpuSparkSession(_ => {
+      val (plan, mainG, _) = buildDppPlan()
+      val updated = new GpuTransitionOverrides().fixupNonAdaptiveBroadcastReuse(plan)
+
+      val reusedChildren = reusedExchangeChildren(updated)
+      assert(reusedChildren.size == 1,
+        s"expected exactly one ReusedExchangeExec under the GpuSubqueryBroadcastExec, got " +
+          s"${reusedChildren.size} in:\n${updated.treeString}")
+      assert(reusedChildren.head eq mainG,
+        s"expected the ReusedExchangeExec to point at the main-plan broadcast G1, got " +
+          s"${reusedChildren.head}")
+    }, baseConf)
+  }
+
+  test("fixupNonAdaptiveBroadcastReuse leaves plans with no main-plan broadcast unchanged") {
+    // No GpuBroadcastExchangeExec in the main plan; just a RangeExec wrapped in a Filter whose
+    // condition still references a DPP-side GpuSubqueryBroadcastExec. The early-exit at the
+    // `if (mainPlanBroadcasts.isEmpty) return p` line in fixupNonAdaptiveBroadcastReuse should
+    // return the input plan unmodified.
+    withGpuSparkSession(_ => {
+      val range = newRange()
+      val dppG = newGpuBroadcast(range, range)
+      val gsb = newGpuSubqueryBroadcast(dppG, range)
+      val inSub = SparkInSubqueryExec(range.output.head, gsb, NamedExpression.newExprId)
+      val plan = FilterExec(inSub, range)
+
+      val updated = new GpuTransitionOverrides().fixupNonAdaptiveBroadcastReuse(plan)
+      assert(updated eq plan,
+        s"expected the plan to be returned unchanged when mainPlanBroadcasts.isEmpty, got:\n" +
+          s"${updated.treeString}")
+    }, baseConf)
+  }
+
+  test("ENABLE_NON_AQE_BROADCAST_REUSE_FIXUP conf accessor flips with the kill switch") {
+    // Scope: this test validates ONLY that the `RapidsConf.isNonAqeBroadcastReuseFixupEnabled`
+    // accessor reflects the conf key. The plan-level gate inside `GpuTransitionOverrides.apply`
+    // (the `if (rapidsConf.isNonAqeBroadcastReuseFixupEnabled ...)` block, identified by code
+    // shape rather than line number so this comment doesn't drift) reads this accessor;
+    // exercising the full gate against a real plan needs a GPU-routed end-to-end run and is
+    // covered by `RapidsDynamicPartitionPruningV1SuiteAEOff` rather than this unit suite.
+    val killSwitchConf = baseConf.clone()
+        .set(RapidsConf.ENABLE_NON_AQE_BROADCAST_REUSE_FIXUP.key, "false")
+    withGpuSparkSession(spark => {
+      val rapidsConf = new RapidsConf(spark.sessionState.conf)
+      assert(!rapidsConf.isNonAqeBroadcastReuseFixupEnabled,
+        "isNonAqeBroadcastReuseFixupEnabled should be false when the kill switch is set")
+    }, killSwitchConf)
+
+    // Use a SparkConf that does NOT set the kill-switch key, so the assertion really
+    // exercises the `createWithDefault(true)` default rather than a redundant "true" override.
+    val defaultConf = new SparkConf()
+        .set("spark.sql.adaptive.enabled", "false")
+        .set("spark.sql.exchange.reuse", "true")
+    withGpuSparkSession(spark => {
+      val rapidsConf = new RapidsConf(spark.sessionState.conf)
+      assert(rapidsConf.isNonAqeBroadcastReuseFixupEnabled,
+        "isNonAqeBroadcastReuseFixupEnabled should default to true (createWithDefault(true))")
+    }, defaultConf)
+  }
+}