Skip to content

Commit 5c83ea5

Browse files
schenksjclaude
andcommitted
test: add end-to-end shuffle test for non-UTF-8 StringType bytes (#4521)
Address review feedback: add a Spark-level regression test demonstrating the bug. cast(binary -> string) is a zero-copy reinterpret in Spark, so a StringType column can hold arbitrary non-UTF-8 bytes. The test disables Comet's Cast so those raw bytes reach Comet's columnar (JVM) shuffle inside a JVM UnsafeRow, exercising the native row->Arrow get_string path that used to panic via from_utf8(..).unwrap() and now decodes lossily. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 367f460 commit 5c83ea5

1 file changed

Lines changed: 30 additions & 0 deletions

File tree

spark/src/test/scala/org/apache/comet/exec/CometColumnarShuffleSuite.scala

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -761,6 +761,36 @@ abstract class CometColumnarShuffleSuite extends CometTestBase with AdaptiveSpar
761761
}
762762
}
763763

764+
// Regression test for https://github.com/apache/datafusion-comet/issues/4521.
765+
//
766+
// Spark's `cast(BinaryType -> StringType)` is a zero-copy reinterpret (and `UnsafeRow`'s
767+
// string accessor performs no UTF-8 validation), so a `StringType` column can legitimately
768+
// hold arbitrary non-UTF-8 bytes that Spark treats as opaque. Comet's columnar (JVM) shuffle
769+
// converts those `UnsafeRow`s to Arrow natively (`process_sorted_row_partition` -> `get_string`),
770+
// which used to decode with `from_utf8(..).unwrap()` and panic on such rows. It now decodes
771+
// lossily (U+FFFD replacements), matching how Spark renders the same bytes.
772+
test("columnar shuffle tolerates non-UTF-8 bytes in a StringType column") {
773+
withParquetTable(
774+
Seq(
775+
// 0xFF and 0xFE are never valid UTF-8 lead bytes; each decodes to a single U+FFFD in
776+
// both Spark and Comet (so the lossy results match exactly).
777+
(1, Array[Byte](0xff.toByte, 0xfe.toByte, 'A'.toByte)),
778+
// 0x80 is a stray continuation byte -> one U+FFFD, followed by valid ASCII.
779+
(2, Array[Byte](0x80.toByte, 'B'.toByte)),
780+
// A fully valid UTF-8 row exercises the zero-cost borrow path.
781+
(3, "valid".getBytes("UTF-8"))),
782+
"tbl") {
783+
// Disable Comet's own Cast so the `cast(binary -> string)` runs in Spark and the raw bytes
784+
// reach the shuffle inside a JVM UnsafeRow. (If Comet performed the cast it would produce a
785+
// pre-sanitized Arrow string array and never exercise get_string.)
786+
withSQLConf(CometConf.getExprEnabledConfigKey("Cast") -> "false") {
787+
val df = sql("SELECT _1, CAST(_2 AS STRING) AS s FROM tbl")
788+
val shuffled = df.repartition(2, $"_1")
789+
checkShuffleAnswer(shuffled, 1)
790+
}
791+
}
792+
}
793+
764794
/**
765795
* Checks that `df` produces the same answer as Spark does, and has the `expectedNum` Comet
766796
* exchange operators.

0 commit comments

Comments
 (0)