Hudi SQL standards #13815
xushiyan
started this conversation in
General Discussions
Hudi SQL standards
#13815
Replies: 2 comments
-
For INSERT OVERWRITE - if we were to align with relational databases or warehouses, we would need to always truncate the table. right? |
Beta Was this translation helpful? Give feedback.
0 replies
-
Not sure if user has the needs for history data set recover or time travel in this case. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
SQL is a primary way for people to interact with Hudi tables, therefore, a set of standard, commonly agreed semantics and behaviors should strictly followed and prominently listed out for everyone, both devs and users. I'm starting this discussion to collect points about improving the semantics and behaviors of using SQL in Hudi. I've noted down a few points to start with. With enough points, we should start an RFC for SQL standards, and then publish it on website as part of the tech spec.
Timestamp format validation
Time travel and incremental queries should conform to the same set of timestamp formats and perform strict validation before the read operation proceeds. Currently, timestamp strings are only loosely validated against
yyyyMMddHHmmssSSS
. A more comprehensive validation should be like:Insert overwrite (table) behavior needs simplification
Currently, using INSERT OVERWRITE (TABLE) to achieve insert overwrite partitions and the table follow this logic:
If
hoodie.datasource.write.operation
=insert_overwrite
, INSERT OVERWRITE (TABLE) always overwrites relevant partitions, either dynamically based on the data, or statically based on the PARTITION() clause.If
hoodie.datasource.write.operation
=insert_overwrite_table
, INSERT OVERWRITE (TABLE) always overwrites the whole table, regardless of having PARTITION() clause or not.If
hoodie.datasource.write.operation
is not set, it looks forhoodie.datasource.overwrite.mode
:hoodie.datasource.overwrite.mode
= DYNAMIC, INSERT OVERWRITE (TABLE) always overwrites relevant partitions, either dynamically based on the data, or statically based on the PARTITION() clause.hoodie.datasource.overwrite.mode
= STATIC,This involves too many configs and cases, we should simplify or re-define the behaviors.
Inferring merge modes and INSERT INTO write operation
preCombineField
(ordering field), theINSERT INTO
implicitly works as insert operation, and merge mode is commit time ordering.preCombineField
(ordering field), theINSERT INTO
implicitly works as upsert operation, and merge mode is event time ordering.The inferring should be prominently listed out as part of the SQL standards.
index management
While we expand the index support in metadata table, we need to set a clear guidance for users to manage indexes. For indexes depending on others, create/drop indexes should be carefully guarded. For example, when record index and secondary indexes have been created, users cannot drop record index before dropping all secondary indexes.
Index names: current
record_index
name is reserved for creating record index, which is not obvious to users. We should allow custom names for indexes, while behind the scene, each created index has another property indicating its type. So no reserved name, just fail when repeated name is given by user when creating an index.What are the SQL commands that you think need clearer definition or simplification? Please add your thoughts too.
Beta Was this translation helpful? Give feedback.
All reactions