fix: reject Parquet TimestampLTZ as TimestampNTZ on Spark 3.x for native_datafusion

andygrove · andygrove · commit b5c3cea6ef8d · 2026-05-17T09:59:28.000-06:00
Pre-Spark-4 (SPARK-36182) rejects reading a Parquet TimestampLTZ column as
TimestampNTZ; native_datafusion previously did not, and silently returned
the UTC instant. Plumb a per-Spark-version flag from ShimCometConf
through the NativeScan proto into SparkParquetOptions, and gate a new
rejection arm in the schema adapter on it.

INT96 remains a gap because DataFusion's coerce_int96 strips the source
timezone before the schema adapter runs, so it is indistinguishable from
a true TIMESTAMP_NTZ source. Compatibility guide updated to describe the
correctness implications.
diff --git a/common/src/main/spark-3.x/org/apache/comet/shims/ShimCometConf.scala b/common/src/main/spark-3.x/org/apache/comet/shims/ShimCometConf.scala
@@ -28,4 +28,11 @@ trait ShimCometConf {
    * SQL conf were removed in favor of this per-version constant; see #4298.
    */
   val COMET_SCHEMA_EVOLUTION_ENABLED: Boolean = false
+
+  /**
+   * Whether reading a Parquet TimestampLTZ column as TimestampNTZ is permitted. Spark 3.x rejects
+   * this read (SPARK-36182), so Comet matches by defaulting to false on 3.x; Spark 4.0
+   * (SPARK-47447) lifted the restriction. See #4219.
+   */
+  val COMET_ALLOW_TIMESTAMP_LTZ_AS_NTZ: Boolean = false
 }
diff --git a/common/src/main/spark-4.x/org/apache/comet/shims/ShimCometConf.scala b/common/src/main/spark-4.x/org/apache/comet/shims/ShimCometConf.scala
@@ -28,4 +28,11 @@ trait ShimCometConf {
    * per-version constant; see #4298.
    */
   val COMET_SCHEMA_EVOLUTION_ENABLED: Boolean = true
+
+  /**
+   * Whether reading a Parquet TimestampLTZ column as TimestampNTZ is permitted. Spark 4.0+
+   * (SPARK-47447) lifted the pre-4.0 SPARK-36182 rejection, so Comet matches with true. See
+   * #4219.
+   */
+  val COMET_ALLOW_TIMESTAMP_LTZ_AS_NTZ: Boolean = true
 }
diff --git a/docs/source/user-guide/latest/compatibility/scans.md b/docs/source/user-guide/latest/compatibility/scans.md
@@ -83,12 +83,18 @@ requires `spark.comet.exec.enabled=true` because the scan node must be wrapped b
 The following `native_datafusion` limitations may produce incorrect results on Spark versions prior to 4.0
 without falling back to Spark:
 
-- Reading `TimestampLTZ` as `TimestampNTZ`. On Spark 3.x, Spark raises an error per
-  [SPARK-36182](https://issues.apache.org/jira/browse/SPARK-36182) because LTZ encodes UTC-adjusted instants
-  that cannot be safely reinterpreted as timezone-free values. Comet does not raise this error and instead
-  returns the raw UTC instant as a `TimestampNTZ` value. This applies to all LTZ physical encodings (INT96,
-  TIMESTAMP_MICROS, TIMESTAMP_MILLIS). On Spark 4.0+, this read is permitted
-  ([SPARK-47447](https://issues.apache.org/jira/browse/SPARK-47447)) and Comet matches Spark's behavior.
+- Reading Parquet `INT96` as `TimestampNTZ` on Spark 3.x. Spark raises
+  `SchemaColumnConvertNotSupportedException` for this read per
+  [SPARK-36182](https://issues.apache.org/jira/browse/SPARK-36182); Comet does not, and silently
+  returns the column's UTC instant as the `TimestampNTZ` value (the Spark 4.0+ semantics from
+  [SPARK-47447](https://issues.apache.org/jira/browse/SPARK-47447)). This is a correctness
+  divergence on Spark 3.x: queries that Spark would have failed instead return values, and those
+  values reflect UTC rather than the session-local wall clock a `TimestampNTZ` is normally
+  understood as, so downstream filters, joins, and aggregations on the column may produce
+  different results than running the same query without Comet. The annotated LTZ encodings
+  (`TIMESTAMP_MICROS`, `TIMESTAMP_MILLIS` with `isAdjustedToUTC=true`) are rejected correctly.
+  INT96 is a gap because DataFusion's `coerce_int96` strips the source timezone before Comet's
+  schema adapter runs, leaving INT96 indistinguishable from a true `TIMESTAMP_NTZ` source.
   See [#4219](https://github.com/apache/datafusion-comet/issues/4219).
 
 ### Schema Mismatch Handling
diff --git a/native/core/src/execution/planner.rs b/native/core/src/execution/planner.rs
@@ -1373,6 +1373,7 @@ impl PhysicalPlanner {
                     common.case_sensitive,
                     common.return_null_struct_if_all_fields_missing,
                     common.allow_type_promotion,
+                    common.allow_timestamp_ltz_to_ntz,
                     self.session_ctx(),
                     common.encryption_enabled,
                     common.use_field_id,
diff --git a/native/core/src/parquet/mod.rs b/native/core/src/parquet/mod.rs
@@ -514,6 +514,7 @@ pub unsafe extern "system" fn Java_org_apache_comet_parquet_Native_initRecordBat
             case_sensitive != JNI_FALSE,
             return_null_struct_if_all_fields_missing != JNI_FALSE,
             true, // allow_type_promotion: JVM side already validated via TypeUtil.checkParquetType
+            true, // allow_timestamp_ltz_to_ntz: JVM side already validated via TypeUtil.checkParquetType
             session_ctx,
             encryption_enabled,
             // The iceberg-compat path resolves IDs in the JVM via NativeBatchReader,
diff --git a/native/core/src/parquet/parquet_exec.rs b/native/core/src/parquet/parquet_exec.rs
@@ -71,6 +71,7 @@ pub(crate) fn init_datasource_exec(
     case_sensitive: bool,
     return_null_struct_if_all_fields_missing: bool,
     allow_type_promotion: bool,
+    allow_timestamp_ltz_to_ntz: bool,
     session_ctx: &Arc<SessionContext>,
     encryption_enabled: bool,
     use_field_id: bool,
@@ -81,6 +82,7 @@ pub(crate) fn init_datasource_exec(
         case_sensitive,
         return_null_struct_if_all_fields_missing,
         allow_type_promotion,
+        allow_timestamp_ltz_to_ntz,
         &object_store_url,
         encryption_enabled,
     );
@@ -200,6 +202,7 @@ fn get_options(
     case_sensitive: bool,
     return_null_struct_if_all_fields_missing: bool,
     allow_type_promotion: bool,
+    allow_timestamp_ltz_to_ntz: bool,
     object_store_url: &ObjectStoreUrl,
     encryption_enabled: bool,
 ) -> (TableParquetOptions, SparkParquetOptions) {
@@ -214,6 +217,7 @@ fn get_options(
     spark_parquet_options.return_null_struct_if_all_fields_missing =
         return_null_struct_if_all_fields_missing;
     spark_parquet_options.allow_type_promotion = allow_type_promotion;
+    spark_parquet_options.allow_timestamp_ltz_to_ntz = allow_timestamp_ltz_to_ntz;
 
     if encryption_enabled {
         table_parquet_options.crypto.configure_factory(
diff --git a/native/core/src/parquet/parquet_support.rs b/native/core/src/parquet/parquet_support.rs
@@ -96,6 +96,11 @@ pub struct SparkParquetOptions {
     /// Whether type promotion (schema evolution) is allowed, e.g. INT32 -> INT64,
     /// FLOAT -> DOUBLE. Mirrors spark.comet.schemaEvolution.enabled.
     pub allow_type_promotion: bool,
+    /// When true, reading a Parquet TimestampLTZ column as TimestampNTZ is
+    /// permitted (Spark 4.0+, SPARK-47447); when false, it is rejected
+    /// (Spark 3.x, SPARK-36182). Mirrors Comet's per-Spark-version constant
+    /// in ShimCometConf.
+    pub allow_timestamp_ltz_to_ntz: bool,
 }
 
 impl SparkParquetOptions {
@@ -112,6 +117,7 @@ impl SparkParquetOptions {
             use_field_id: false,
             ignore_missing_field_id: false,
             allow_type_promotion: false,
+            allow_timestamp_ltz_to_ntz: false,
         }
     }
 
@@ -128,6 +134,7 @@ impl SparkParquetOptions {
             use_field_id: false,
             ignore_missing_field_id: false,
             allow_type_promotion: false,
+            allow_timestamp_ltz_to_ntz: false,
         }
     }
 }
diff --git a/native/core/src/parquet/schema_adapter.rs b/native/core/src/parquet/schema_adapter.rs
@@ -807,6 +807,31 @@ impl SparkPhysicalExprAdapter {
                 return Ok(Transformed::yes(rejection));
             }
 
+            // Spark 3.x refuses to read a Parquet TimestampLTZ column as
+            // TimestampNTZ (SPARK-36182); Spark 4.0 (SPARK-47447) lifted that.
+            // The flag tracks Comet's per-Spark-version constant in
+            // ShimCometConf. Deferred to runtime so empty files (SPARK-26709)
+            // still pass. See #4219.
+            //
+            // INT96 columns surface as `Timestamp(_, None)` after `coerce_int96`
+            // strips the timezone, so this pattern only catches TIMESTAMP_MICROS
+            // / TIMESTAMP_MILLIS reads. INT96 -> TimestampNTZ is handled elsewhere.
+            if !self.parquet_options.allow_timestamp_ltz_to_ntz
+                && matches!(
+                    (physical_type, target_type),
+                    (DataType::Timestamp(_, Some(_)), DataType::Timestamp(_, None))
+                )
+            {
+                let rejection = reject_on_non_empty_expr(
+                    Arc::clone(&child),
+                    cast.target_field(),
+                    cast.input_field().name(),
+                    physical_type,
+                    target_type,
+                );
+                return Ok(Transformed::yes(rejection));
+            }
+
             // Scalar/complex mismatch (e.g. TIMESTAMP read as ARRAY<TIMESTAMP>):
             // Spark's vectorized reader rejects with
             // SchemaColumnConvertNotSupportedException (SPARK-45604). Same-shape
diff --git a/native/proto/src/proto/operator.proto b/native/proto/src/proto/operator.proto
@@ -127,6 +127,11 @@ message NativeScanCommon {
   // with a disallowed promoted type throws an error matching Spark's
   // SchemaColumnConvertNotSupportedException behavior.
   bool allow_type_promotion = 17;
+  // When true, reading a Parquet TimestampLTZ column as TimestampNTZ is
+  // permitted (Spark 4.0+, SPARK-47447); when false, it is rejected with
+  // SchemaColumnConvertNotSupportedException (Spark 3.x, SPARK-36182). Set
+  // from Comet's per-Spark-version constant in ShimCometConf.
+  bool allow_timestamp_ltz_to_ntz = 18;
 }
 
 message NativeScan {
diff --git a/spark/src/main/scala/org/apache/comet/serde/operator/CometNativeScan.scala b/spark/src/main/scala/org/apache/comet/serde/operator/CometNativeScan.scala
@@ -212,6 +212,7 @@ object CometNativeScan extends CometOperatorSerde[CometScanExec] with Logging {
         scan.conf.getConf(SQLConf.IGNORE_MISSING_PARQUET_FIELD_ID))
 
       commonBuilder.setAllowTypePromotion(CometConf.COMET_SCHEMA_EVOLUTION_ENABLED)
+      commonBuilder.setAllowTimestampLtzToNtz(CometConf.COMET_ALLOW_TIMESTAMP_LTZ_AS_NTZ)
 
       // Collect S3/cloud storage configurations
       val hadoopConf = scan.relation.sparkSession.sessionState
diff --git a/spark/src/test/scala/org/apache/comet/parquet/ParquetTimestampLtzAsNtzSuite.scala b/spark/src/test/scala/org/apache/comet/parquet/ParquetTimestampLtzAsNtzSuite.scala
@@ -42,22 +42,29 @@ class ParquetTimestampLtzAsNtzSuite extends CometTestBase {
 
   private val tsTypes = Seq("INT96", "TIMESTAMP_MICROS", "TIMESTAMP_MILLIS")
 
-  tsTypes.foreach { tsType =>
-    test(s"read TimestampLTZ ($tsType) as TimestampNTZ throws pre-Spark 4") {
-      assume(!isSpark40Plus, "Spark 4.0+ allows reading TimestampLTZ as TimestampNTZ")
+  private val scanImpls =
+    Seq(CometConf.SCAN_NATIVE_ICEBERG_COMPAT, CometConf.SCAN_NATIVE_DATAFUSION)
 
-      val scanImpl = CometConf.COMET_NATIVE_SCAN_IMPL.get()
+  for {
+    tsType <- tsTypes
+    scanImpl <- scanImpls
+  } {
+    test(s"read TimestampLTZ ($tsType) as TimestampNTZ throws pre-Spark 4 ($scanImpl)") {
+      assume(!isSpark40Plus, "Spark 4.0+ allows reading TimestampLTZ as TimestampNTZ")
+      // INT96 cannot be detected on the native_datafusion path: DataFusion's coerce_int96
+      // strips the timezone, so by the time Comet's schema adapter runs, an INT96 column is
+      // indistinguishable from a TIMESTAMP_NTZ_MICROS column. Tracked separately under #4219.
       assume(
-        scanImpl != CometConf.SCAN_AUTO && scanImpl != CometConf.SCAN_NATIVE_DATAFUSION,
-        s"https://github.com/apache/datafusion-comet/issues/4219 ($scanImpl scan does not " +
-          "reject TimestampLTZ read as TimestampNTZ)")
+        !(tsType == "INT96" && scanImpl == CometConf.SCAN_NATIVE_DATAFUSION),
+        "https://github.com/apache/datafusion-comet/issues/4219 (INT96 + native_datafusion)")
 
       val sessionTz = "America/Los_Angeles"
 
       withSQLConf(
         SQLConf.SESSION_LOCAL_TIMEZONE.key -> sessionTz,
         SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> tsType,
-        SQLConf.USE_V1_SOURCE_LIST.key -> "parquet") {
+        SQLConf.USE_V1_SOURCE_LIST.key -> "parquet",
+        CometConf.COMET_NATIVE_SCAN_IMPL.key -> scanImpl) {
         withTempPath { dir =>
           val path = dir.getCanonicalPath
           Seq(Timestamp.valueOf("2020-01-01 12:00:00")).toDF("ts").write.parquet(path)

Original file line number	Diff line number	Diff line change
`@@ -96,6 +96,11 @@ pub struct SparkParquetOptions {`
`96`	`96`	`/// Whether type promotion (schema evolution) is allowed, e.g. INT32 -> INT64,`
`97`	`97`	`/// FLOAT -> DOUBLE. Mirrors spark.comet.schemaEvolution.enabled.`
`98`	`98`	`pub allow_type_promotion: bool,`
	`99`	`+ /// When true, reading a Parquet TimestampLTZ column as TimestampNTZ is`
	`100`	`+ /// permitted (Spark 4.0+, SPARK-47447); when false, it is rejected`
	`101`	`+ /// (Spark 3.x, SPARK-36182). Mirrors Comet's per-Spark-version constant`
	`102`	`+ /// in ShimCometConf.`
	`103`	`+ pub allow_timestamp_ltz_to_ntz: bool,`
`99`	`104`	`}`
`100`	`105`
`101`	`106`	`impl SparkParquetOptions {`
`@@ -112,6 +117,7 @@ impl SparkParquetOptions {`
`112`	`117`	`use_field_id: false,`
`113`	`118`	`ignore_missing_field_id: false,`
`114`	`119`	`allow_type_promotion: false,`
	`120`	`+ allow_timestamp_ltz_to_ntz: false,`
`115`	`121`	`}`
`116`	`122`	`}`
`117`	`123`
`@@ -128,6 +134,7 @@ impl SparkParquetOptions {`
`128`	`134`	`use_field_id: false,`
`129`	`135`	`ignore_missing_field_id: false,`
`130`	`136`	`allow_type_promotion: false,`
	`137`	`+ allow_timestamp_ltz_to_ntz: false,`
`131`	`138`	`}`
`132`	`139`	`}`
`133`	`140`	`}`