-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Describe the bug, including details regarding any error messages, version, and platform.
The DataFusion Comet project's unit test use the ExampleParquetWriter to create Parquet files - https://github.com/apache/datafusion-comet/blob/996362e78d497c02542f1e29dbb7cba3ec16f64c/spark/src/test/scala/org/apache/spark/sql/CometTestBase.scala#L432
This was inspired by similar unit test code in Spark - https://github.com/apache/spark/blob/ece14704cc083f17689d2e0b9ab8e31cf71a7a2d/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala#L871
The output files can contain uint8 and uint16 values that are illegal per the spec. For example in this file -
alltypes_extended_plain.parquet.zip
The columns _8
and _9
are uint_8
and uint_16
values and contain illegal negative values.
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": true, "_2": 18, "_3": 10002, "_4": 10002, "_5": 10002, "_6": 10002.0, "_7": 10002.0, "_8": "100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002100021000210002", "_9": -18, "_10": -10002, "_11": -10002, "_12": -10002, "_13": "10002", "_14": [50, 50, 50], "_15": 10002, "_16": 10002, "_17": [50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50], "_18": 10002, "_19": 10002, "_20": 10002}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": true, "_2": 20, "_3": 10004, "_4": 10004, "_5": 10004, "_6": 10004.0, "_7": 10004.0, "_8": "100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004100041000410004", "_9": -20, "_10": -10004, "_11": -10004, "_12": -10004, "_13": "10004", "_14": [52, 52, 52], "_15": 10004, "_16": 10004, "_17": [52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52, 52], "_18": 10004, "_19": 10004, "_20": 10004}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
{"_1": true, "_2": 24, "_3": 10008, "_4": 10008, "_5": 10008, "_6": 10008.0, "_7": 10008.0, "_8": "100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008100081000810008", "_9": -24, "_10": -10008, "_11": -10008, "_12": -10008, "_13": "10008", "_14": [56, 56, 56], "_15": 10008, "_16": 10008, "_17": [56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56], "_18": 10008, "_19": 10008, "_20": 10008}
{"_1": null, "_2": null, "_3": null, "_4": null, "_5": null, "_6": null, "_7": null, "_8": null, "_9": null, "_10": null, "_11": null, "_12": null, "_13": null, "_14": null, "_15": null, "_16": null, "_17": null, "_18": null, "_19": null, "_20": null}
Taking as an example the first value for column _8
the bit pattern written to the file is 0xffffffee
which gets read as a negative value which is illegal for a unsigned int.
The value originates in this line - https://github.com/apache/datafusion-comet/blob/996362e78d497c02542f1e29dbb7cba3ec16f64c/spark/src/test/scala/org/apache/spark/sql/CometTestBase.scala#L520
where a negative value is cast to a byte and then written to Parquet. The Parquet writer needs to cast correctly to a larger type before writing to the file.
The values written can be read by the Parquet-java reader but other implementations are free to return an error or null for such values which is not desirable.
Component(s)
Core