feat(temporal): add Spark-style timezone conversions#6919
Conversation
Implements from_utc_timestamp, to_utc_timestamp, and convert_timezone for Spark parity (Eventual-Inc#3798). FromUtcTimestamp interprets the input as a UTC instant and returns the wall-clock time in the given timezone as a tz-naive Timestamp. ToUtcTimestamp does the inverse. convert_timezone is a Python and SQL alias over the existing convert_time_zone with Spark's reversed argument order (target_tz, source_ts).
Greptile SummaryThis PR adds Spark-compatible
Confidence Score: 5/5Safe to merge — the three new temporal functions are self-contained additions with no changes to existing code paths. All daft-schema helpers correctly treat i64 epoch values as UTC, aligning with Arrow semantics and Spark-matching behavior. SQLConvertTimezone correctly routes to ConvertTimeZone using the to_timezone named argument. The previously flagged Int64Array/Timestamp dtype mismatch is resolved with typed Arrow arrays. No existing functionality is modified. No files require special attention. Important Files Changed
Reviews (3): Last reviewed commit: "chore(temporal): address Greptile review..." | Re-trigger Greptile |
| let arrow_arr: arrow_array::ArrayRef = Arc::new(Int64Array::from(values)); | ||
| Series::from_arrow( | ||
| Arc::new(Field::new( | ||
| input.name().to_string(), | ||
| DataType::Timestamp(time_unit, None), | ||
| )), | ||
| arrow_arr, | ||
| ) |
There was a problem hiding this comment.
Int64Array used instead of typed Arrow Timestamp array
Both FromUtcTimestamp and ToUtcTimestamp build an Int64Array and then wrap it in a Field whose dtype is Timestamp(time_unit, None). Arrow's physical type for Int64Array is DataType::Int64, which mismatches the Timestamp dtype declared in the Field. Series::from_arrow must silently reinterpret the Int64 as a Timestamp for the tests to pass.
If from_arrow ever adds a stricter type-check (or does so under certain configurations), this will fail at runtime. The safer approach is to use a typed TimestampArray via arrow_array::PrimitiveArray::<arrow_array::types::TimestampMicrosecondType> (or the appropriate unit variant). The same issue exists at line 293 in ToUtcTimestamp::call.
There was a problem hiding this comment.
Fixed with a typed-array helper. TimestampSecond/Milli/Micro/Nanosecond depending on the input TimeUnit. Arrow physical dtype now matches the Field.
| let naive_local = match &input_tz_parsed { | ||
| Some(in_tz) => timestamp_to_naive_local(ts, time_unit, in_tz), | ||
| None => timestamp_to_naive_datetime(ts, time_unit), | ||
| }; | ||
| let utc_ts = | ||
| naive_local_to_timestamp(naive_local, time_unit, &tz_parsed, &timezone)?; | ||
| values.push(Some(utc_ts)); |
There was a problem hiding this comment.
DST transition behavior diverges from Spark
naive_local_to_timestamp propagates DaftError::ValueError for both ambiguous times (DST fall-back) and nonexistent times (DST spring-forward). Spark instead silently resolves ambiguous times to the pre-transition offset and advances past the gap for nonexistent times.
A user migrating from Spark who has timestamps around a DST boundary will get a runtime error where Spark would have produced a result. The difference is worth documenting in the docstring even if the strict behavior is intentional.
There was a problem hiding this comment.
Documented in the to_utc_timestamp docstring. Keeping the strict behavior for now. silently picking a side on DST transitions can be just as surprising as erroring.
| def test_utc_timestamp_sql() -> None: | ||
| df = daft.from_pydict({"ts": [datetime(2017, 7, 14, 2, 40)]}) # noqa: F841 | ||
| from_result = daft.sql("SELECT from_utc_timestamp(ts, 'Europe/London') AS local FROM df").to_pydict() | ||
| assert from_result["local"] == [datetime(2017, 7, 14, 3, 40)] | ||
|
|
||
| to_result = daft.sql("SELECT to_utc_timestamp(ts, 'Europe/London') AS utc FROM df").to_pydict() | ||
| # The naive wall-clock 02:40 interpreted in BST is 01:40 UTC. | ||
| assert to_result["utc"] == [datetime(2017, 7, 14, 1, 40)] |
There was a problem hiding this comment.
No SQL test for
convert_timezone
SQLConvertTimezone is registered and tested in Python (test_convert_timezone_spark_alias) but there is no SQL execution test. A basic SELECT convert_timezone('America/New_York', ts) FROM df assertion would verify the SQL planner correctly builds the reversed-argument ConvertTimeZone call.
There was a problem hiding this comment.
Added test_convert_timezone_sql.
Use typed Timestamp{Second,Millisecond,Microsecond,Nanosecond}Array so the
Arrow physical dtype matches the Field's Timestamp dtype. Document the DST
divergence from Spark (strict ValueError on ambiguous/nonexistent local
times) in the to_utc_timestamp docstring. Add SQL test for convert_timezone.
|
@greptile re-review |
|
Thanks @BABTUNA . Overall looks good. 1 minor improvement + conflict resolution needed: No DST regression test — the docstring describes the spring-forward gap and fall-back ambiguity as errors, but there's no test pinning that contract. Worth adding: Without this, if someone later "fixes" naive_local_to_timestamp to silently resolve to Spark-style behavior, the docstring lies and no test catches it. |
…ersions # Conflicts: # src/daft-functions-temporal/src/lib.rs # src/daft-sql/src/modules/temporal.rs # tests/dataframe/test_temporals.py
|
Went ahead and added test_to_utc_timestamp_dst_gap_raises using the 2021-03-14 02:30 US/Eastern spring-forward case you suggested. Also resolved the three merge conflicts against main since add_months and months_between 👍 |
Summary
Implements three more functions from issue #3798 by adding Spark-style
from_utc_timestamp,to_utc_timestamp, andconvert_timezoneas native Daft temporal expressions.This PR adds two new scalar UDFs in the temporal module for UTC↔local conversions that return tz-naive timestamps (matching Spark semantics), plus a
convert_timezonealias over the existingconvert_time_zonethat reverses the argument order to match Spark. Python and SQL surfaces are both wired.Why
The issue asks for parity with PySpark's temporal functions. This PR focuses on:
from_utc_timestamp) producing tz-naive output.to_utc_timestamp) producing tz-naive output.convert_timezone(target_tz, source_ts)alias that matches Spark's argument order.Changes Made
FromUtcTimestampandToUtcTimestampscalar UDFs insrc/daft-functions-temporal/src/time.rs:daft_schema::time_unithelpers (parse_timezone,timestamp_to_naive_local,naive_local_to_timestamp,naive_datetime_to_timestamp).Timestamp(unit, None)regardless of input tz label.daft-schemaas a direct dependency insrc/daft-functions-temporal/Cargo.tomlso the helpers are available to the UDFs.mod timetopub mod timeinsrc/daft-functions-temporal/src/lib.rsso the SQL crate can register handlers.FromUtcTimestampandToUtcTimestampinTemporalFunctions.SQLFromUtcTimestamp,SQLToUtcTimestamp, andSQLConvertTimezoneinsrc/daft-sql/src/modules/temporal.rs. The convert_timezone handler delegates to the existingConvertTimeZoneUDF with Spark's reversed argument order.from_utc_timestamp,to_utc_timestamp, andconvert_timezoneindaft/functions/datetime.pyand export them fromdaft/functions/__init__.py.tests/dataframe/test_temporals.py:from_utc_timestampcoverage: named tz (Europe/London BST), fixed offset (+05:30), tz-aware input.to_utc_timestampcoverage: named tz.convert_timezoneSpark-style alias.Behavior
from_utc_timestamp('2017-07-14 02:40:00', 'Europe/London')returns2017-07-14 03:40:00(BST is UTC+1 in July).to_utc_timestamp('2017-07-14 03:40:00', 'Europe/London')returns2017-07-14 02:40:00.Timestamp(unit, None)matching Spark.from_utc_timestamptreats the i64 as a UTC instant regardless of any tz label on the input;to_utc_timestampextracts the wall-clock using the input's own tz label (or treats naive as UTC), then re-interprets that wall-clock in the supplied tz.convert_timezone(target_tz, source_ts)is equivalent toconvert_time_zone(source_ts, target_tz)and requires the source to be tz-aware (nofrom_timezoneargument)."Not/A/Zone") error at planning time with a clear message.Test Plan
cargo check -p daft-functions-temporal -p daft-sqlmake buildDAFT_RUNNER=native pytest -q tests/dataframe/test_temporals.py -k "utc_timestamp or convert_timezone or round_trip"Related Issues