Skip to content

Conversation

@jcamachor
Copy link
Contributor

@jcamachor jcamachor commented Feb 14, 2025

feat: This PR modifies the VirtualTable message to introduce a new record_list_expression field. The record_list_expression field is an Expression that evaluates into a list of structs (LIST<STRUCT<T1, ..., Tn>>).

This improvement leverages Substrait's support for nested types and provides greater flexibility than the existing representation, i.e., it allows using a single dynamic parameter expression (#780) to represent the records in a virtual table.

@EpsilonPrime
Copy link
Member

While I like this simplification there is on thing I'm concerned about. That is the lack of widespread support for complex types (here list and structs). A simplistic Substrait consumer could still be considered compliant without that support. But if we require support that means systems like DuckDB will need that support before they can even define a virtual table which is a way of providing data without a physical table. I suppose we could implement this as an alternative with the expectation that both methods could still be options a few years from now.

@jcamachor
Copy link
Contributor Author

Thanks, @EpsilonPrime ! That makes sense--rather than deprecating the field, I've updated it to use a oneof. I've already addressed that in the latest commit, but there's still an issue since repeated isn't allowed within oneof. I'll take a look at that later today.

@jcamachor
Copy link
Contributor Author

@EpsilonPrime, could you take a look at the latest iteration? I had to wrap the repeated fields in a message, but this approach isn't ideal since it breaks backward compatibility. Other possible options include: (1) introducing a new VirtualTable message, or (2) keeping the fields separate and not using oneof, shifting correctness responsibility to producers and validation responsibility to consumers. I'm not sure what's the best path forward. Please let me know your thoughts.

@EpsilonPrime
Copy link
Member

I'm sure this will be discussed at this week's community meeting. In the meantime two possibilities could be taken here:

  1. Forget about using a one of and declare that both are used (in order of appearance).
  2. Since the purpose of this change is to allow dynamic parameters in literals we could just define a literal type that needs to be filled in using a dynamic parameter.

@EpsilonPrime
Copy link
Member

Ah, but the goal here is to return all of the data for the table in one expression and not too allow dynamic parameters throughout. So my solutions don't work. It's either this new way or the old way. It's probably still fine without the oneof. Providing a whole table as a dynamic parameter is possible. The closest analog we have with the existing behavior would to be able to return an entire column (except we don't have anything that specifies it is a row definition). I suppose we could continue to use the expressions table with an enum that specifies whether the expressions are values (current behavior), columns (requiring that each expression makes sense as a column) and table (requiring only one expression). Need to fiddle with that idea to make it work though (I haven't gone through all the cases).

@jcamachor
Copy link
Contributor Author

Ah, but the goal here is to return all of the data for the table in one expression and not too allow dynamic parameters throughout.

Actually, the main idea was that a single expression can accomodate both patterns (or anything in between). That is, you can have (1) a single dynamic expression of type list, (2) a list expression of dynamic expressions of type struct, or (3) a list expression of struct expressions where each field is a dynamic expression corresponding to a column in the virtual table.

The original concern about this approach was that there is a lack of widespread support for complex types (here list and structs). However, I assume the main concern is lists, since structs are already required for compatibility with the latest version of VirtualTable (given that the field with literals had already been deprecated).
I checked and DuckDB supports nested types, including lists, so that shouldn't be an issue.
For consumers that do not natively support lists, deserializing an expression of type list wouldn't be significantly different from handling repeated structs, so I'm not sure that's a strong concern. Maybe we should reconsider the original approach with a single expression? (I'm OK to proceed either way, but I just want to make sure we're not introducing unnecessary ambiguity into the standard.)

@jcamachor
Copy link
Contributor Author

I'm sure this will be discussed at this week's community meeting

I won't be able to attend tomorrow, but please pass along any questions that come up.

@jacques-n
Copy link
Contributor

What do you think about just adding Unnest relation. Then we can use a single literal expression in a virtual table and push towards unnest for the actual unrolling (rather than trying to overload virtual table to do this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants