RFC-80: Design proposal discussions #14062

the-other-tim-brown · 2025-10-08T19:46:52Z

the-other-tim-brown
Oct 8, 2025

I'm creating a discussion page here to breakdown the pros and cons for the different approaches so we can reply in threads to help organize the conversation.

Context: https://github.com/apache/hudi/pull/13924/files#diff-f952a62cde8bc5596286031141122a7aa4f15ac9e1b0b4305a9b926d22e3e853R190

Requirements:

All existing functionality needs to continue working. This includes but is not limited to upserts, time travel queries, incremental queries, etc.

Goals:

Updates should be efficient even in the presence of blogs, unstructured data, or larger text fields.
We should maintain our ability to filter files for efficient snapshot and incremental queries

the-other-tim-brown · 2025-10-08T19:54:55Z

the-other-tim-brown
Oct 8, 2025
Author

Writer path:

Baseline (No split column groups):

Pros:

There is a single file group that a given key is assigned to so the lookup is only done once

Cons:

If there is unstructured data or columns with larger values, we will have less keys per file to maintain the target file size.
If there is an update to a subset of the columns, the full file gets rewritten.

Proposal A:

Pros:

There is a single file group that a given key is assigned to so the lookup is only done once
Updates to a subset of columns can avoid parsing and rewriting all the columns if those columns belong to a subset of the column groups

Cons:

The files within a group all contain the same keys and if the size of the data per column group is not balanced, you will have unbalanced file sizes. This will result in small or large files and can later impact the read and update performance.

Proposal B:

Pros:

Updates to a subset of columns can avoid parsing and rewriting all the columns if those columns belong to a subset of the column groups
The file sizes grow independently, allowing for well sized files within each column group.

Cons:

A given key now belongs to N file groups where N is the number of column groups. This increases the cost of an upsert operation.

1 reply

vinothchandar Oct 9, 2025
Collaborator

This increases the cost of an upsert operation.

Assuming record level index can now store record_key -> fileGroupForColGroupA, fileGroupForColGroupB, .... Then, the update costs are same across A or B, right?. we write a log file (block) per column group affected.. (whether its in the same file group or different)

the-other-tim-brown · 2025-10-08T20:05:20Z

the-other-tim-brown
Oct 8, 2025
Author

Reader path:

Baseline (No split column groups):

Pros:

All data is in a base file + log files and can be easily read
Can easily prune the files that must be read

Cons:

Since total number of keys per file is potentially smaller, we will have more files to open even if only a subset of columns is used.

Proposal A:

Pros:

If we can maintain consistent ordering between the column groups, we can open multiple iterators and just iterate through them and join the values to compute the final row.
We can more easily prune files that need to be read since they are grouped by the keys.

Cons:

Potentially small files can lead to performance issues.
If the ordering of keys is not consistent between the files, we will need to do a join on the rows or buffer some of the files in memory to compute the final rows.
If event time ordering is used and the ordering field is not in the column group that is read, then we will potentially need to read the value from the other file group to properly determine the final row when merging log files.

Proposal B:

Pros:

Well sized files leads to better read performance for individual files

Cons:

Since row keys are now split amongst various file groups, the rows must be computed by doing a join between the column groups.
If a filter is specified on a field in a column group, we will not be able to easily prune the candidate files from the other file groups leading to more IO for a given query.
For incremental queries, if the commit time is only reflected in the updated column groups then we may not be able to effectively filter out files since we can only know when the row was updated after joining all the column groups

0 replies

vinothchandar · 2025-10-09T01:36:07Z

vinothchandar
Oct 9, 2025
Collaborator

Context on A vs B here: https://github.com/apache/hudi/pull/13924/files#diff-f952a62cde8bc5596286031141122a7aa4f15ac9e1b0b4305a9b926d22e3e853R190

0 replies

danny0405 · 2025-10-09T01:41:37Z

danny0405
Oct 9, 2025
Collaborator

From high-level, I feel like we should take read scenarios with higher priorities here, since the warehousing is prune to be more reads than writes. Based on this facet, A looks like a better choice.

Another decision I want to confirm is do we want to bring column group notion for table that just has one "column group"? would that bring in unnecessary overhead for read and write paths, can you plot a analysis here?

1 reply

vinothchandar Oct 9, 2025
Collaborator

do we want to bring column group notion for table that just has one "column group"

it should be considered part of a default column group, completely hidden from users, backwards compatible. I don't like us having two code paths everywhere. no column group vs with column group.

vinothchandar · 2025-10-09T02:20:11Z

vinothchandar
Oct 9, 2025
Collaborator

In summary: B may be achievable by the user already:

By the user splitting the columns into two tables, sharing a record key.
if Hudi lands the multi-table transactions https://github.com/apache/hudi/blob/master/rfc/rfc-73/rfc-73.md , to keep the tables in sync.
And a storage-partitioned join or some mechanism for target engines like Spark/Ray, to adapt the plans and scan the same record without shuffles.

I think we should deeply understand the ML, AI pipeline lifecycle that will use this table/data, and the access-patterns. For e.g. may be it is in fact desirable to not cluster the unstructured data (to keep distribution randomized) or actually preferrable to cluster them (e.g. reorganize a table with html documents based on the url domain)

0 replies

RFC-80: Design proposal discussions #14062

Uh oh!

Uh oh!

the-other-tim-brown Oct 8, 2025

Replies: 5 comments · 2 replies

Uh oh!

Uh oh!

the-other-tim-brown Oct 8, 2025 Author

Writer path:

Baseline (No split column groups):

Proposal A:

Proposal B:

Uh oh!

vinothchandar Oct 9, 2025 Collaborator

Uh oh!

Uh oh!

the-other-tim-brown Oct 8, 2025 Author

Reader path:

Baseline (No split column groups):

Proposal A:

Proposal B:

Uh oh!

vinothchandar Oct 9, 2025 Collaborator

Uh oh!

Uh oh!

danny0405 Oct 9, 2025 Collaborator

Uh oh!

vinothchandar Oct 9, 2025 Collaborator

Uh oh!

vinothchandar Oct 9, 2025 Collaborator

the-other-tim-brown
Oct 8, 2025

Replies: 5 comments 2 replies

the-other-tim-brown
Oct 8, 2025
Author

vinothchandar Oct 9, 2025
Collaborator

the-other-tim-brown
Oct 8, 2025
Author

vinothchandar
Oct 9, 2025
Collaborator

danny0405
Oct 9, 2025
Collaborator

vinothchandar Oct 9, 2025
Collaborator

vinothchandar
Oct 9, 2025
Collaborator