Skip to content

feat(temporal): add Spark-style timezone conversions#6919

Open
BABTUNA wants to merge 12 commits into
Eventual-Inc:mainfrom
BABTUNA:feat/temporal-tz-conversions
Open

feat(temporal): add Spark-style timezone conversions#6919
BABTUNA wants to merge 12 commits into
Eventual-Inc:mainfrom
BABTUNA:feat/temporal-tz-conversions

Conversation

@BABTUNA

@BABTUNA BABTUNA commented May 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements three more functions from issue #3798 by adding Spark-style from_utc_timestamp, to_utc_timestamp, and convert_timezone as native Daft temporal expressions.

This PR adds two new scalar UDFs in the temporal module for UTC↔local conversions that return tz-naive timestamps (matching Spark semantics), plus a convert_timezone alias over the existing convert_time_zone that reverses the argument order to match Spark. Python and SQL surfaces are both wired.

Why

The issue asks for parity with PySpark's temporal functions. This PR focuses on:

  • UTC → local wall-clock conversion (from_utc_timestamp) producing tz-naive output.
  • Local wall-clock → UTC conversion (to_utc_timestamp) producing tz-naive output.
  • A Spark-style convert_timezone(target_tz, source_ts) alias that matches Spark's argument order.

Changes Made

  • Add FromUtcTimestamp and ToUtcTimestamp scalar UDFs in src/daft-functions-temporal/src/time.rs:
    • Both reuse daft_schema::time_unit helpers (parse_timezone, timestamp_to_naive_local, naive_local_to_timestamp, naive_datetime_to_timestamp).
    • Output dtype is always Timestamp(unit, None) regardless of input tz label.
  • Add daft-schema as a direct dependency in src/daft-functions-temporal/Cargo.toml so the helpers are available to the UDFs.
  • Change mod time to pub mod time in src/daft-functions-temporal/src/lib.rs so the SQL crate can register handlers.
  • Register FromUtcTimestamp and ToUtcTimestamp in TemporalFunctions.
  • Add SQL handlers SQLFromUtcTimestamp, SQLToUtcTimestamp, and SQLConvertTimezone in src/daft-sql/src/modules/temporal.rs. The convert_timezone handler delegates to the existing ConvertTimeZone UDF with Spark's reversed argument order.
  • Add Python wrappers from_utc_timestamp, to_utc_timestamp, and convert_timezone in daft/functions/datetime.py and export them from daft/functions/__init__.py.
  • Add focused tests in tests/dataframe/test_temporals.py:
    • from_utc_timestamp coverage: named tz (Europe/London BST), fixed offset (+05:30), tz-aware input.
    • to_utc_timestamp coverage: named tz.
    • Round-trip identity for non-DST instants.
    • Null propagation, invalid timezone error path.
    • SQL integration for both UTC conversions.
    • convert_timezone Spark-style alias.

Behavior

  • from_utc_timestamp('2017-07-14 02:40:00', 'Europe/London') returns 2017-07-14 03:40:00 (BST is UTC+1 in July).
  • to_utc_timestamp('2017-07-14 03:40:00', 'Europe/London') returns 2017-07-14 02:40:00.
  • Both functions always return a tz-naive Timestamp(unit, None) matching Spark.
  • from_utc_timestamp treats the i64 as a UTC instant regardless of any tz label on the input; to_utc_timestamp extracts the wall-clock using the input's own tz label (or treats naive as UTC), then re-interprets that wall-clock in the supplied tz.
  • convert_timezone(target_tz, source_ts) is equivalent to convert_time_zone(source_ts, target_tz) and requires the source to be tz-aware (no from_timezone argument).
  • Invalid timezone strings (e.g. "Not/A/Zone") error at planning time with a clear message.
  • Null in the input row propagates to null in the output.

Test Plan

  • cargo check -p daft-functions-temporal -p daft-sql
  • make build
  • DAFT_RUNNER=native pytest -q tests/dataframe/test_temporals.py -k "utc_timestamp or convert_timezone or round_trip"

Related Issues

Implements from_utc_timestamp, to_utc_timestamp, and convert_timezone for
Spark parity (Eventual-Inc#3798). FromUtcTimestamp interprets the input as a UTC
instant and returns the wall-clock time in the given timezone as a tz-naive
Timestamp. ToUtcTimestamp does the inverse. convert_timezone is a Python
and SQL alias over the existing convert_time_zone with Spark's reversed
argument order (target_tz, source_ts).
@BABTUNA BABTUNA requested a review from a team as a code owner May 12, 2026 04:34
@github-actions github-actions Bot added the feat label May 12, 2026
@greptile-apps

greptile-apps Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds Spark-compatible from_utc_timestamp, to_utc_timestamp, and convert_timezone to Daft's temporal module, wiring them through Python, SQL, and Rust. The implementation reuses existing daft-schema chrono helpers and correctly handles tz-naive/tz-aware inputs, DST errors, and null propagation.

  • FromUtcTimestamp and ToUtcTimestamp are new scalar UDFs in time.rs backed by timestamp_to_naive_local/naive_local_to_timestamp; build_timestamp_array ensures Arrow physical types match the declared Field dtype.
  • convert_timezone is a thin Python/SQL alias over ConvertTimeZone that reverses argument order to match Spark; the SQL handler correctly maps target_str to the to_timezone named argument expected by ConvertArgs.
  • Tests cover round-trips, fixed-offset timezones, null propagation, invalid timezone error paths, and SQL integration for all three functions.

Confidence Score: 5/5

Safe to merge — the three new temporal functions are self-contained additions with no changes to existing code paths.

All daft-schema helpers correctly treat i64 epoch values as UTC, aligning with Arrow semantics and Spark-matching behavior. SQLConvertTimezone correctly routes to ConvertTimeZone using the to_timezone named argument. The previously flagged Int64Array/Timestamp dtype mismatch is resolved with typed Arrow arrays. No existing functionality is modified.

No files require special attention.

Important Files Changed

Filename Overview
src/daft-functions-temporal/src/time.rs Adds FromUtcTimestamp and ToUtcTimestamp UDFs with correct semantics; typed Arrow arrays fix the previous Int64/Timestamp dtype mismatch; shared UtcConversionArgs struct is well-structured.
src/daft-sql/src/modules/temporal.rs Three new SQL handlers registered; SQLConvertTimezone correctly maps the reversed Spark argument order to ConvertTimeZone's to_timezone named arg; argument count validation is consistent.
daft/functions/datetime.py Three Python wrappers added with accurate docstrings; convert_timezone delegates to convert_time_zone with reversed args; from/to_utc_timestamp pass timezone as the expected named keyword.
tests/dataframe/test_temporals.py Nine new tests covering BST/fixed-offset conversions, round-trip identity, null propagation, invalid timezone error, and SQL integration for all three functions.
daft/functions/init.py Three new symbols imported and added to all in alphabetically correct positions.
src/daft-functions-temporal/src/lib.rs mod time changed from private to pub so SQL crate can import; FromUtcTimestamp and ToUtcTimestamp registered with TemporalFunctions.
src/daft-functions-temporal/Cargo.toml daft-schema added as a direct dependency so time_unit helpers are available to the new UDFs.

Reviews (3): Last reviewed commit: "chore(temporal): address Greptile review..." | Re-trigger Greptile

Comment thread src/daft-functions-temporal/src/time.rs Outdated
Comment on lines +212 to +219
let arrow_arr: arrow_array::ArrayRef = Arc::new(Int64Array::from(values));
Series::from_arrow(
Arc::new(Field::new(
input.name().to_string(),
DataType::Timestamp(time_unit, None),
)),
arrow_arr,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Int64Array used instead of typed Arrow Timestamp array

Both FromUtcTimestamp and ToUtcTimestamp build an Int64Array and then wrap it in a Field whose dtype is Timestamp(time_unit, None). Arrow's physical type for Int64Array is DataType::Int64, which mismatches the Timestamp dtype declared in the Field. Series::from_arrow must silently reinterpret the Int64 as a Timestamp for the tests to pass.

If from_arrow ever adds a stricter type-check (or does so under certain configurations), this will fail at runtime. The safer approach is to use a typed TimestampArray via arrow_array::PrimitiveArray::<arrow_array::types::TimestampMicrosecondType> (or the appropriate unit variant). The same issue exists at line 293 in ToUtcTimestamp::call.

@BABTUNA BABTUNA May 12, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed with a typed-array helper. TimestampSecond/Milli/Micro/Nanosecond depending on the input TimeUnit. Arrow physical dtype now matches the Field.

Comment on lines +282 to +288
let naive_local = match &input_tz_parsed {
Some(in_tz) => timestamp_to_naive_local(ts, time_unit, in_tz),
None => timestamp_to_naive_datetime(ts, time_unit),
};
let utc_ts =
naive_local_to_timestamp(naive_local, time_unit, &tz_parsed, &timezone)?;
values.push(Some(utc_ts));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 DST transition behavior diverges from Spark

naive_local_to_timestamp propagates DaftError::ValueError for both ambiguous times (DST fall-back) and nonexistent times (DST spring-forward). Spark instead silently resolves ambiguous times to the pre-transition offset and advances past the gap for nonexistent times.

A user migrating from Spark who has timestamps around a DST boundary will get a runtime error where Spark would have produced a result. The difference is worth documenting in the docstring even if the strict behavior is intentional.

@BABTUNA BABTUNA May 12, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented in the to_utc_timestamp docstring. Keeping the strict behavior for now. silently picking a side on DST transitions can be just as surprising as erroring.

Comment on lines +1148 to +1155
def test_utc_timestamp_sql() -> None:
df = daft.from_pydict({"ts": [datetime(2017, 7, 14, 2, 40)]}) # noqa: F841
from_result = daft.sql("SELECT from_utc_timestamp(ts, 'Europe/London') AS local FROM df").to_pydict()
assert from_result["local"] == [datetime(2017, 7, 14, 3, 40)]

to_result = daft.sql("SELECT to_utc_timestamp(ts, 'Europe/London') AS utc FROM df").to_pydict()
# The naive wall-clock 02:40 interpreted in BST is 01:40 UTC.
assert to_result["utc"] == [datetime(2017, 7, 14, 1, 40)]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No SQL test for convert_timezone

SQLConvertTimezone is registered and tested in Python (test_convert_timezone_spark_alias) but there is no SQL execution test. A basic SELECT convert_timezone('America/New_York', ts) FROM df assertion would verify the SQL planner correctly builds the reversed-argument ConvertTimeZone call.

@BABTUNA BABTUNA May 12, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test_convert_timezone_sql.

Use typed Timestamp{Second,Millisecond,Microsecond,Nanosecond}Array so the
Arrow physical dtype matches the Field's Timestamp dtype. Document the DST
divergence from Spark (strict ValueError on ambiguous/nonexistent local
times) in the to_utc_timestamp docstring. Add SQL test for convert_timezone.
@BABTUNA

BABTUNA commented May 12, 2026

Copy link
Copy Markdown
Contributor Author

@greptile re-review

@madvart madvart self-requested a review May 12, 2026 20:00
@madvart

madvart commented May 16, 2026

Copy link
Copy Markdown
Contributor

Thanks @BABTUNA . Overall looks good. 1 minor improvement + conflict resolution needed:

No DST regression test — the docstring describes the spring-forward gap and fall-back ambiguity as errors, but there's no test pinning that contract. Worth adding:
def test_to_utc_timestamp_dst_gap_raises():
# US/Eastern spring-forward: 2021-03-14 02:30 doesn't exist
df = daft.from_pydict({"ts": [datetime(2021, 3, 14, 2, 30)]})
with pytest.raises(Exception, match="(?i)ambiguous|nonexistent|local"):
df.with_column("utc", to_utc_timestamp(col("ts"), "America/New_York")).collect()

Without this, if someone later "fixes" naive_local_to_timestamp to silently resolve to Spark-style behavior, the docstring lies and no test catches it.

…ersions

# Conflicts:
#	src/daft-functions-temporal/src/lib.rs
#	src/daft-sql/src/modules/temporal.rs
#	tests/dataframe/test_temporals.py
@BABTUNA

BABTUNA commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

Went ahead and added test_to_utc_timestamp_dst_gap_raises using the 2021-03-14 02:30 US/Eastern spring-forward case you suggested. Also resolved the three merge conflicts against main since add_months and months_between 👍

@BABTUNA

BABTUNA commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

@colin-ho @madvart okok this one was updated with main and is ready

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants