RFC-80: Design proposal discussions #14062
Replies: 5 comments 2 replies
-
Writer path:Baseline (No split column groups):Pros:
Cons:
Proposal A:Pros:
Cons:
Proposal B:Pros:
Cons:
|
Beta Was this translation helpful? Give feedback.
-
Reader path:Baseline (No split column groups):Pros:
Cons:
Proposal A:Pros:
Cons:
Proposal B:Pros:
Cons:
|
Beta Was this translation helpful? Give feedback.
-
Context on A vs B here: https://github.com/apache/hudi/pull/13924/files#diff-f952a62cde8bc5596286031141122a7aa4f15ac9e1b0b4305a9b926d22e3e853R190 |
Beta Was this translation helpful? Give feedback.
-
From high-level, I feel like we should take read scenarios with higher priorities here, since the warehousing is prune to be more reads than writes. Based on this facet, A looks like a better choice. Another decision I want to confirm is do we want to bring column group notion for table that just has one "column group"? would that bring in unnecessary overhead for read and write paths, can you plot a analysis here? |
Beta Was this translation helpful? Give feedback.
-
In summary: B may be achievable by the user already:
I think we should deeply understand the ML, AI pipeline lifecycle that will use this table/data, and the access-patterns. For e.g. may be it is in fact desirable to not cluster the unstructured data (to keep distribution randomized) or actually preferrable to cluster them (e.g. reorganize a table with html documents based on the url domain) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm creating a discussion page here to breakdown the pros and cons for the different approaches so we can reply in threads to help organize the conversation.
Context: https://github.com/apache/hudi/pull/13924/files#diff-f952a62cde8bc5596286031141122a7aa4f15ac9e1b0b4305a9b926d22e3e853R190
Requirements:
Goals:
Beta Was this translation helpful? Give feedback.
All reactions