Hudi SQL standards #13815

xushiyan · 2025-08-29T21:35:32Z

xushiyan
Aug 29, 2025
Collaborator

SQL is a primary way for people to interact with Hudi tables, therefore, a set of standard, commonly agreed semantics and behaviors should strictly followed and prominently listed out for everyone, both devs and users. I'm starting this discussion to collect points about improving the semantics and behaviors of using SQL in Hudi. I've noted down a few points to start with. With enough points, we should start an RFC for SQL standards, and then publish it on website as part of the tech spec.

Timestamp format validation

Time travel and incremental queries should conform to the same set of timestamp formats and perform strict validation before the read operation proceeds. Currently, timestamp strings are only loosely validated against yyyyMMddHHmmssSSS. A more comprehensive validation should be like:

Hudi Timeline format (highest matching precedence): yyyyMMddHHmmssSSS or yyyyMMddHHmmss.
Unix epoch time in seconds, milliseconds, microseconds, or nanoseconds.
ISO 8601 format including but not limited to:
- yyyy-MM-dd'T'HH:mm:ss.SSS+00:00
- yyyy-MM-dd'T'HH:mm:ss.SSSZ
- yyyy-MM-dd'T'HH:mm:ss.SSS
- yyyy-MM-dd'T'HH:mm:ss+00:00
- yyyy-MM-dd'T'HH:mm:ssZ
- yyyy-MM-dd'T'HH:mm:ss
- yyyy-MM-dd

Insert overwrite (table) behavior needs simplification

Currently, using INSERT OVERWRITE (TABLE) to achieve insert overwrite partitions and the table follow this logic:

If hoodie.datasource.write.operation = insert_overwrite, INSERT OVERWRITE (TABLE) always overwrites relevant partitions, either dynamically based on the data, or statically based on the PARTITION() clause.
If hoodie.datasource.write.operation = insert_overwrite_table, INSERT OVERWRITE (TABLE) always overwrites the whole table, regardless of having PARTITION() clause or not.

If hoodie.datasource.write.operation is not set, it looks for hoodie.datasource.overwrite.mode:

If hoodie.datasource.overwrite.mode = DYNAMIC, INSERT OVERWRITE (TABLE) always overwrites relevant partitions, either dynamically based on the data, or statically based on the PARTITION() clause.
If hoodie.datasource.overwrite.mode = STATIC,
- INSERT OVERWRITE (TABLE) PARTITION() will overwrite the specified partition
- INSERT OVERWRITE (TABLE) (no static PARTITION clause) will always overwrite the whole table

This involves too many configs and cases, we should simplify or re-define the behaviors.

Inferring merge modes and INSERT INTO write operation

When a table is created with no preCombineField (ordering field), the INSERT INTO implicitly works as insert operation, and merge mode is commit time ordering.
When a table is created with preCombineField (ordering field), the INSERT INTO implicitly works as upsert operation, and merge mode is event time ordering.

The inferring should be prominently listed out as part of the SQL standards.

index management

While we expand the index support in metadata table, we need to set a clear guidance for users to manage indexes. For indexes depending on others, create/drop indexes should be carefully guarded. For example, when record index and secondary indexes have been created, users cannot drop record index before dropping all secondary indexes.

Index names: current record_index name is reserved for creating record index, which is not obvious to users. We should allow custom names for indexes, while behind the scene, each created index has another property indicating its type. So no reserved name, just fail when repeated name is given by user when creating an index.

What are the SQL commands that you think need clearer definition or simplification? Please add your thoughts too.

vinothchandar · 2025-09-04T02:56:01Z

vinothchandar
Sep 4, 2025
Collaborator

This involves too many configs and cases, we should simplify or re-define the behaviors.

For INSERT OVERWRITE - if we were to align with relational databases or warehouses, we would need to always truncate the table. right?

0 replies

danny0405 · 2025-09-04T03:13:58Z

danny0405
Sep 4, 2025
Collaborator

For INSERT OVERWRITE - if we were to align with relational databases or warehouses, we would need to always truncate the table. right?

Not sure if user has the needs for history data set recover or time travel in this case.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hudi SQL standards #13815

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Hudi SQL standards #13815

Uh oh!

xushiyan Aug 29, 2025 Collaborator

Timestamp format validation

Insert overwrite (table) behavior needs simplification

Inferring merge modes and INSERT INTO write operation

index management

Replies: 2 comments

Uh oh!

vinothchandar Sep 4, 2025 Collaborator

Uh oh!

danny0405 Sep 4, 2025 Collaborator

xushiyan
Aug 29, 2025
Collaborator

vinothchandar
Sep 4, 2025
Collaborator

danny0405
Sep 4, 2025
Collaborator