-
Notifications
You must be signed in to change notification settings - Fork 187
feat: add names into struct type #800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Edit: Edit: If this doesn't address your issue, I think sharing a code sample would be helpful |
|
I agree that many systems support named structs. Substrait does as well. But our goal is supporting expression operations, not mapping 1:1 with how a system might internally represent something. The internals of most optimizers are name ambivalent (at worst) or name ignorant at best. My sense is that the core cases where we may be lacking schema capabilities is dynamic based on data values. In the past I've argued that this is largely the domain of custom types and functions. ... So for 1 I'd ask that you come up with a set of specific use cases that cannot be addressed (either at all or without contortions). Ideally written up so someone doesn't go have to study some function docs for a certain system before understanding the problem. |
I'm not sure I'm entirely following the example. But, is the goal to take input like...
And then get output like...
If so, we will maybe also need a discussion about the behavior of coalesce in the presence of structs but let's table that for now. Is the basic argument here that name-based matching would recognize that the |
It's actually a bit simpler - you have input like this, ie an array of structs
Where LEFT comes from an input relation, while RIGHT comes from a literal. Now you want to call The input rel has names (either because it reads from a parquet file where all columns and inner fields are named, or because it's a VirtualTableScan with names, etc). The literal doesn't have names. However at least in DataFusion, coalesce requires the types of the inputs to match, which isn't possible (or at least I can't see how) with Substrait today w/o writing a special implementation for |
|
Edit: this was incorrect, ReadRel contains a NamedStruct so it has all of the inner names and can validate they match input data. |
NamedStruct is not a Type, it cannot be used as expression's output type or anywhere else where a Type is required.
Yes, that's what the rest of my sentence says: "but rather provide the names at input and output levels (and optionally in Hint on per plan node)" :) |
| message ReadRel { | ||
| RelCommon common = 1; | ||
| NamedStruct base_schema = 2; | ||
| Type.Struct base_schema = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would be a breaking change, so I guess we can't do it. I'm just including it here for discussion - if we were to add names to the type itself, then we could replace (through some evolutions) the NamedStruct usages with Type.Struct.
|
I was curious how other DBs handle coalescing structs. DuckDB happily allows it, but the result combines both struct types, meaning the output (even the type of it) depends on the names of the fields: Spark-sql for 3.5.1 complains about mismatching types |
SELECT COALESCE( {'key1': 'value1', 'key2': 42}
,{'key3': 'value2', 'key4': 44});Unfortunately, my dev version of duckdb is not based on a recent version at the moment. Could you (or someone) produce it and include it here? If I find some extra time laying around I can do some environment munging to do it, I just don't have any right this moment. I think the following will produce the binary blob: CALL get_substrait("
SELECT COALESCE(
{ 'key1': 'value1', 'key2': 42}
,{ 'key3': 'value2', 'key4': 44}
);
");and if you put that in a script and redirect to a file you should get the serialized plan: and hopefully you can share the stringified version (json or just protobuf's stringification should be fine). |
|
I'd like to echo Jacques comment on distinguishing plan validation and plan representation better. As I see it, a source system needs to validate a plan in its original representation itself and before converting it to Substrait. Once it knows that the plan is valid and is sure about what the semantics are, it can translate the plan into a Substrait plan with the same semantics. At that point, it should not have to do any validation of its own types anymore. This also applies to naming: it needs to find out what names refer to before Substrait comes into the picture. When converting that plan to Substrait, It has to translate these names into positions, at which point the names cannot be necessary anymore. To concretely point out how this should be applied in some of the examples given above:
I think that this is something that DataFusion has to validate or otherwise deal with before it constructs the Substrait plan.
Again: If these aren't the same type in the source system, then that's about the semantics of that system and that system needs to verify its semantics itself. Once it has figured out whether a plan is valid in its own representation, it can translate it to the Substrait equivalent. |
My point here was that the executability-validity of a Substrait plan depended on the input data in a way that I thought was impossible to validate based on just the Substrait plan itself; however, I was wrong on this point I think - ReadRel (so also NamedTable) contains a NamedStruct so it has the input names, and the scenario I outlined there wasn't valid.
|
I don't fully understand. The names provided in |
The names are relevant for whether e.g. coalescing two columns with Type.Struct is valid. Since many (if not all?) DB systems considers structs to have named fields, if the inner field names were not included in ReadRels, one could not say if a plan doing coalesce(struct_column_1, struct_column_2) is valid or not. Given they are, it might be. (This probably depends a bit on how one defines "plan validity", so we may disagree.) |
|
If there is a coalesce operation that works on structs to take the first non-nullable item of each struct item it would work over the positions of each sub item. So {$1:1}, {$2:2} would become {$1:1,$2:2,$3:NULL} where the input and output type would be a struct with three fields defined. |
What is the exact semantic of such a function? The one defined in this repo does not recurse into the struct fields, i.e., only considers if the struct as whole is |
@ingomueller-net For my case, exactly that. I'm not talking about recursing into the struct fields. (The simplest example of my case is #800 (comment)). The problem is this:
Which is an issue for two reasons:
|
|
OK, that's helpful clarification. To summarize: You have a query like I think that this is a question for the source system and should be done there, before Substrait gets into the picture. I imagine the flow should be roughly like this:
Is there any reason why you can't do it like this in your case? |
So the case you outline is the other example I had, #800 (comment), where both columns come from the input data. That case is actually fine, since Substrait ReadRel contains a NamedStruct i.e. it has inner names for both columns. The problematic case is |
I think my comment applies to both because the checking of the struct names has to be done before the plan gets converted to Substrait. In other words, when you do that validation, you should not have to rely on Substrait to provide you the names of the structs. For your specific example, I think what you should do is:
|
That's fine, the source system handles this at analysis phase (or something like that, I assume).
So the answer here is to strip the names from the input in the consumer system, so that everything then matches Substrait's namelessness. I think that would work, but it feels quite hacky, and is quite unfortunate for anyone trying to read the plans in the consumer system. Why is that a better solution than adding an ability to include the names in places where they'd be useful? |
Awesome! If we agree here, we made a huge step forward 😄
I can only speculate but my first reflex would be to say that if structs had names, we would suddenly have to take them into account, for example, for type checking. They are also longer and thus go against the goal of being concise, which plays some role for a serialization format. |
|
I've thought about this more and I think we've all reached an understanding that I don't see a problem with a "named struct" as a literal, in which case adding a The alternate to adding a |
I still don't agree, but it's clear I'm in the minority here 😅
It'd actually need to be both literal and type, I think, for the change to make sense. For my example that sparked this issue, the case needs an empty array literal, which in Substrait is
VirtualTables already contain the names through the ReadRel's they're part of, that's fine. Doesn't do anything to help literals and types though.
I don't understand why having two types would be better than adding the names (it can be optional for all I care!) to the existing type 😅 |
It's not, I think you misread my comment.
VirtualTables can contain literals.
I already said that adding Edit: an empty list seems fine to me too. If it has no structs then those structs have no names? |
|
There is a lot of activity here and I'm traveling so haven't been able to give it fair attention (and will continue to be traveling for another week or so). My main feeling is that @Blizzara feels like it is too hard to map his plans to substrait plans. I think we should always start by trying to fix this at the language level as opposed to changing the specification. I could see an argument to add a expression level field that is a hint for output names so you can name intermediate expressions (given a+b+c, be able to name the a+b part of it). I also think we should better clarify that the hint here should be the same traversal order for nested structs as RelRoot. (It's implied but not stated explicitily.) With that and better language level apis, I think @Blizzara's frustration would be addressed. |
|
Trying to catch up. Would it be fair to say the crux of the issue is: If a function or relation requires struct type names to determine the output type then Substrait's current form is insufficient. I think this is the problem we are encountering with |
I could definitely see that |
|
There was an ask for a more concrete example - so here's a Spark query that fails to roundtrip in current substrait-spark, and fails to execute in DataFusion. The Substrait plan created is as follows: The error in DF shows the issue with inner field names, "inner_struct_field" vs "c0": |
|
I assume should construct a virtual table like so:
That seems to work as expected. And then The second argument to coalesce is the empty list literal: The second argument to DF's The direct change to solve this issue would be to add If we look at the relevant
What really makes this scenario an unfortunate combination of things is:
A "hacky" solution is to add a Further, ScalarFunction.output_type would need to accommodate named structs, meaning that a solution must either be part of the type system or the output type should be expanded to literals (which sort of feels like 1.5x "hackyness" and not quite 2x "hackyness" to me). Changing |
That's very useful -- thanks a lot! I think the problem is neither in the Substrait spec nor in the translation from Spark SQL to Substrait; the plan you provide looks like what I would expect. What I think is going wrong is the translation from the Substrait spec to the internal plan format of DataFusion. It seems like in that conversion, the positional structs are converted to named structs, and they are converted incorrectly: of two structs that should have the same type, one has the field name |
|
@ingomueller-net so you think that an empty named_struct should always reflect the other input? I suppose that seems consistent given that the expectation seems to be that the producer would need to do the inverse to encode an empty named_struct in the way that @Blizzara is describing. Specifically, if the producer would encode the I don't think this helps with the general use case, though? when the argument to coalesce isn't an empty array, it seems like there's no valid way to include the struct field names. I'd be interested to see what substrait-spark attempts to produce though. |
In most DB systems, structs are or at least can be named, i.e. the Struct type contains (name, type) pairs for the inner fields.
Substrait has decided not to name the inner fields in the types, but rather provide the names at input and output levels (and optionally in Hint on per plan node). While this approach works in general, it makes for some really complicated code (like in substrait-spark and datafusion). This complexity is annoying and makes it harder for systems to adopt Substrait.
However, some cases are tricky to deal with even with more complexity for any system that does consider the names inside the Struct types, like DataFusion. For example, what we just hit in Spark -> DF:
F.coalesce(array_of_struct_column, F.array()). Assuming that thearray_of_struct_column(or at least the struct it contains) comes from the input data, it does have names. However theF.array()when translated into Substrait only knows its type is an "array of struct with field types X,Y,Z..", but it doesn't not know the names. Now DF'scoalescewould need to combine two different structs, but it won't. In theory it of course is possible by looking at the field indices instead of the names, but that's not how you'd normally want things to work - I'd argue it's reasonable for the engine thatF.coalesce(F.named_struct("A", Int), F.named_struct("B", Int))should fail.It's probably possible to work around by using a custom coalesce operator for Substrait plans (which is sad and prone for errors), or by removing the names from the inputs, so that the autogenerated inner names would match (which makes the plans quite unreadable since now all names are lost), or possibly by doing some magic in the Substrait consumer to look at neighbouring expressions' types or something and rename fields based on that. But I'd argue at least our life would be so much simpler if Substrait could contain the field names as part of the Struct type.
Do you see other recommended solutions? Are there counter-arguments to including the names?