[SPARK-56594][SQL] Add time_bucket scalar function by vranes · Pull Request #55535 · apache/spark

vranes · 2026-04-24T11:17:23Z

What changes were proposed in this pull request?

This PR adds a new scalar SQL function time_bucket(bucket_size, ts[, origin]) that aligns a timestamp to the start of a fixed-size interval bucket. Given a bucket size (day-time or year-month interval), a timestamp, and an optional origin, it returns the start of the half-open bucket [start, start + bucket_size) containing the timestamp.

Buckets are anchored at origin (default 1970-01-01 00:00:00 (UTC for TIMESTAMP)) and the grid extends infinitely in both directions. All bucketing is performed on UTC micros; the session time zone does not affect bucket alignment. For local wall-clock alignment in a DST zone, users can cast the TIMESTAMP to TIMESTAMP_NTZ.

Changes:

New TimeBucket expression in sql/catalyst/.../expressions/datetimeExpressions.scala with an ExpressionBuilder that dispatches to two- or three-argument forms.
Bucketing helpers timeBucketDTInterval / timeBucketYMInterval in DateTimeUtils.scala, with overflow checks (Math.subtractExact, Math.multiplyExact) on extreme timestamps and origins.
Registered in FunctionRegistry.
Scala API: functions.time_bucket(...).
PySpark API: pyspark.sql.functions.time_bucket + Connect variant.

Why are the changes needed?

Aligning timestamps to fixed-size buckets (15 minutes, 1 hour, 1 month, etc.) is a common time-series pattern, but today users must assemble it manually, e.g., via date_trunc for calendar-aligned buckets or unix-timestamp arithmetic for fixed-second buckets, neither of which supports arbitrary year-month intervals or a non-default origin.

A time_bucket function matches the idiom popularized by PostgreSQL / TimescaleDB and makes the operation safe, concise, and composable.

Does this PR introduce any user-facing change?

Yes, a new function time_bucket is available in SQL, Scala, and PySpark.

Example:

SELECT time_bucket(INTERVAL '15' MINUTE, TIMESTAMP '2024-01-01 11:27:00');
-- 2024-01-01 11:15:00

SELECT time_bucket(
  INTERVAL '15' MINUTE,
  TIMESTAMP '2024-01-01 11:27:00',
  TIMESTAMP '1970-01-01 00:05:00');
-- 2024-01-01 11:20:00

How was this patch tested?

New unit tests in DateExpressionsSuite (codegen + interpreted paths, DT and YM intervals, TIMESTAMP/TIMESTAMP_NTZ, NULL propagation, negative/zero bucket-size validation, ExpressionBuilder).
New unit tests in DateTimeUtilsSuite for timeBucketDTInterval / timeBucketYMInterval including boundary values, negative timestamps, and extreme-origin overflow paths.
New SQL golden file sql-tests/inputs/time-bucket.sql covering: DT + YM interval buckets, TIMESTAMP + TIMESTAMP_NTZ, explicit origins, DST-safe NTZ-cast pattern (America/Los_Angeles), NULL propagation, invalid inputs (non-foldable, wrong types, non-positive).
PySpark doctest.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

[SPARK-56594][SQL] Add time_bucket scalar function

81550e7

vranes marked this pull request as ready for review April 24, 2026 11:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56594][SQL] Add time_bucket scalar function#55535

[SPARK-56594][SQL] Add time_bucket scalar function#55535
vranes wants to merge 1 commit intoapache:masterfrom
vranes:time-bucket

vranes commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vranes commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vranes commented Apr 24, 2026 •

edited

Loading