Commit 5c83ea5
test: add end-to-end shuffle test for non-UTF-8 StringType bytes (#4521)
Address review feedback: add a Spark-level regression test demonstrating
the bug. cast(binary -> string) is a zero-copy reinterpret in Spark, so a
StringType column can hold arbitrary non-UTF-8 bytes. The test disables
Comet's Cast so those raw bytes reach Comet's columnar (JVM) shuffle inside
a JVM UnsafeRow, exercising the native row->Arrow get_string path that used
to panic via from_utf8(..).unwrap() and now decodes lossily.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>1 parent 367f460 commit 5c83ea5
1 file changed
Lines changed: 30 additions & 0 deletions
Lines changed: 30 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
761 | 761 | | |
762 | 762 | | |
763 | 763 | | |
| 764 | + | |
| 765 | + | |
| 766 | + | |
| 767 | + | |
| 768 | + | |
| 769 | + | |
| 770 | + | |
| 771 | + | |
| 772 | + | |
| 773 | + | |
| 774 | + | |
| 775 | + | |
| 776 | + | |
| 777 | + | |
| 778 | + | |
| 779 | + | |
| 780 | + | |
| 781 | + | |
| 782 | + | |
| 783 | + | |
| 784 | + | |
| 785 | + | |
| 786 | + | |
| 787 | + | |
| 788 | + | |
| 789 | + | |
| 790 | + | |
| 791 | + | |
| 792 | + | |
| 793 | + | |
764 | 794 | | |
765 | 795 | | |
766 | 796 | | |
| |||
0 commit comments