Skip to content

[SPARK-56594][SQL] Add time_bucket scalar function#55535

Open
vranes wants to merge 1 commit intoapache:masterfrom
vranes:time-bucket
Open

[SPARK-56594][SQL] Add time_bucket scalar function#55535
vranes wants to merge 1 commit intoapache:masterfrom
vranes:time-bucket

Conversation

@vranes
Copy link
Copy Markdown

@vranes vranes commented Apr 24, 2026

What changes were proposed in this pull request?

This PR adds a new scalar SQL function time_bucket(bucket_size, ts[, origin]) that aligns a timestamp to the start of a fixed-size interval bucket. Given a bucket size (day-time or year-month interval), a timestamp, and an optional origin, it returns the start of the half-open bucket [start, start + bucket_size) containing the timestamp.

Buckets are anchored at origin (default 1970-01-01 00:00:00 (UTC for TIMESTAMP)) and the grid extends infinitely in both directions. All bucketing is performed on UTC micros; the session time zone does not affect bucket alignment. For local wall-clock alignment in a DST zone, users can cast the TIMESTAMP to TIMESTAMP_NTZ.

Changes:

  • New TimeBucket expression in sql/catalyst/.../expressions/datetimeExpressions.scala with an ExpressionBuilder that dispatches to two- or three-argument forms.
  • Bucketing helpers timeBucketDTInterval / timeBucketYMInterval in DateTimeUtils.scala, with overflow checks (Math.subtractExact, Math.multiplyExact) on extreme timestamps and origins.
  • Registered in FunctionRegistry.
  • Scala API: functions.time_bucket(...).
  • PySpark API: pyspark.sql.functions.time_bucket + Connect variant.

Why are the changes needed?

Aligning timestamps to fixed-size buckets (15 minutes, 1 hour, 1 month, etc.) is a common time-series pattern, but today users must assemble it manually, e.g., via date_trunc for calendar-aligned buckets or unix-timestamp arithmetic for fixed-second buckets, neither of which supports arbitrary year-month intervals or a non-default origin.

A time_bucket function matches the idiom popularized by PostgreSQL / TimescaleDB and makes the operation safe, concise, and composable.

Does this PR introduce any user-facing change?

Yes, a new function time_bucket is available in SQL, Scala, and PySpark.

Example:

SELECT time_bucket(INTERVAL '15' MINUTE, TIMESTAMP '2024-01-01 11:27:00');
-- 2024-01-01 11:15:00

SELECT time_bucket(
  INTERVAL '15' MINUTE,
  TIMESTAMP '2024-01-01 11:27:00',
  TIMESTAMP '1970-01-01 00:05:00');
-- 2024-01-01 11:20:00

How was this patch tested?

  • New unit tests in DateExpressionsSuite (codegen + interpreted paths, DT and YM intervals, TIMESTAMP/TIMESTAMP_NTZ, NULL propagation, negative/zero bucket-size validation, ExpressionBuilder).
  • New unit tests in DateTimeUtilsSuite for timeBucketDTInterval / timeBucketYMInterval including boundary values, negative timestamps, and extreme-origin overflow paths.
  • New SQL golden file sql-tests/inputs/time-bucket.sql covering: DT + YM interval buckets, TIMESTAMP + TIMESTAMP_NTZ, explicit origins, DST-safe NTZ-cast pattern (America/Los_Angeles), NULL propagation, invalid inputs (non-foldable, wrong types, non-positive).
  • PySpark doctest.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code

@vranes vranes marked this pull request as ready for review April 24, 2026 11:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant