Skip to content

Conversation

@Blizzara
Copy link
Contributor

In most DB systems, structs are or at least can be named, i.e. the Struct type contains (name, type) pairs for the inner fields.

Substrait has decided not to name the inner fields in the types, but rather provide the names at input and output levels (and optionally in Hint on per plan node). While this approach works in general, it makes for some really complicated code (like in substrait-spark and datafusion). This complexity is annoying and makes it harder for systems to adopt Substrait.

However, some cases are tricky to deal with even with more complexity for any system that does consider the names inside the Struct types, like DataFusion. For example, what we just hit in Spark -> DF: F.coalesce(array_of_struct_column, F.array()). Assuming that the array_of_struct_column (or at least the struct it contains) comes from the input data, it does have names. However the F.array() when translated into Substrait only knows its type is an "array of struct with field types X,Y,Z..", but it doesn't not know the names. Now DF's coalesce would need to combine two different structs, but it won't. In theory it of course is possible by looking at the field indices instead of the names, but that's not how you'd normally want things to work - I'd argue it's reasonable for the engine that F.coalesce(F.named_struct("A", Int), F.named_struct("B", Int)) should fail.

It's probably possible to work around by using a custom coalesce operator for Substrait plans (which is sad and prone for errors), or by removing the names from the inputs, so that the autogenerated inner names would match (which makes the plans quite unreadable since now all names are lost), or possibly by doing some magic in the Substrait consumer to look at neighbouring expressions' types or something and rename fields based on that. But I'd argue at least our life would be so much simpler if Substrait could contain the field names as part of the Struct type.

Do you see other recommended solutions? Are there counter-arguments to including the names?

@drin
Copy link
Member

drin commented Apr 11, 2025

Edit:
wait, there's a NamedStruct: https://github.com/substrait-io/substrait/blob/main/proto/substrait/type.proto#L248-L273. Where are you using Struct directly?

Edit:
Re-reading this, I think "Substrait has decided not to name the inner fields in the types..." isn't accurate, since lines 258 and 259 provide an example of named inner fields (c and d): https://github.com/substrait-io/substrait/blob/main/proto/substrait/type.proto#L258-L259

If this doesn't address your issue, I think sharing a code sample would be helpful

@jacques-n
Copy link
Contributor

I agree that many systems support named structs. Substrait does as well. But our goal is supporting expression operations, not mapping 1:1 with how a system might internally represent something. The internals of most optimizers are name ambivalent (at worst) or name ignorant at best. My sense is that the core cases where we may be lacking schema capabilities is dynamic based on data values. In the past I've argued that this is largely the domain of custom types and functions.

...
This may be a fundamental problem that requires a fundamental change like what is proposed here but we need to go through the process of (1) agreeing there is a problem and (2) designing a solution. My gut is this is a giant hammer for a set of edge case problems (and gaps in language libraries/binding).

So for 1 I'd ask that you come up with a set of specific use cases that cannot be addressed (either at all or without contortions). Ideally written up so someone doesn't go have to study some function docs for a certain system before understanding the problem.

@westonpace
Copy link
Member

Now DF's coalesce would need to combine two different structs, but it won't. In theory it of course is possible by looking at the field indices instead of the names, but that's not how you'd normally want things to work

I'm not sure I'm entirely following the example. But, is the goal to take input like...

LEFT RIGHT
{ x: 7, y: 3 } { y: 1, x: 1 }
{ x: NULL, y: 5 } { y: 1, x: 1 }
NULL { y: 1, x: 1 }
{ x: 2, y: NULL } { y: 1, x: 1 }

And then get output like...

OUTPUT
{ x: 7, y: 3 }
{ x: 1, y: 5 }
{ x: 1, y: 1 }
{ x: 2, y: 1 }

If so, we will maybe also need a discussion about the behavior of coalesce in the presence of structs but let's table that for now.

Is the basic argument here that name-based matching would recognize that the x/y and y/x are the same fields? Meanwhile index-based matching would not?

@Blizzara
Copy link
Contributor Author

@westonpace

I'm not sure I'm entirely following the example. But, is the goal to take input like...

It's actually a bit simpler - you have input like this, ie an array of structs

LEFT (input rel) RIGHT (literal)
[{ x: 7, y: 3 }] []
NULL []

Where LEFT comes from an input relation, while RIGHT comes from a literal. Now you want to call coalesce(LEFT, RIGHT) to replace the null values with the empty lists.

The input rel has names (either because it reads from a parquet file where all columns and inner fields are named, or because it's a VirtualTableScan with names, etc). The literal doesn't have names. However at least in DataFusion, coalesce requires the types of the inputs to match, which isn't possible (or at least I can't see how) with Substrait today w/o writing a special implementation for coalesce() which deals with this case by harmonizing the struct types.

@Blizzara
Copy link
Contributor Author

Blizzara commented Apr 14, 2025

An additional argument is that currently the Substrait type system (and thus eg.. Substrait validator) is not able to distinguish between two Struct types, even though most systems would probably say they are different. Given an input rel (parquet files, for example):

LEFT RIGHT
{x: 1, y: 2} {a: 3, b: 4}

Substrait would happily consider these the same type (NamedTable doesn't contain column names so there's nothing in the Substrait plan saying this is wrong), and allow e.g. coalesce(LEFT, RIGHT). But in real life, the types are not the same, and they're not compatible and cannot be coalesced together.

(Personally I don't care that much about validating Substrait plans, but I think this may still be worth considering.)

Edit: this was incorrect, ReadRel contains a NamedStruct so it has all of the inner names and can validate they match input data.

@Blizzara
Copy link
Contributor Author

Edit: wait, there's a NamedStruct: https://github.com/substrait-io/substrait/blob/main/proto/substrait/type.proto#L248-L273. Where are you using Struct directly?

NamedStruct is not a Type, it cannot be used as expression's output type or anywhere else where a Type is required.

Edit: Re-reading this, I think "Substrait has decided not to name the inner fields in the types..." isn't accurate, since lines 258 and 259 provide an example of named inner fields (c and d): https://github.com/substrait-io/substrait/blob/main/proto/substrait/type.proto#L258-L259

If this doesn't address your issue, I think sharing a code sample would be helpful

Yes, that's what the rest of my sentence says: "but rather provide the names at input and output levels (and optionally in Hint on per plan node)" :)

message ReadRel {
RelCommon common = 1;
NamedStruct base_schema = 2;
Type.Struct base_schema = 2;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this would be a breaking change, so I guess we can't do it. I'm just including it here for discussion - if we were to add names to the type itself, then we could replace (through some evolutions) the NamedStruct usages with Type.Struct.

@Blizzara
Copy link
Contributor Author

I was curious how other DBs handle coalescing structs.

DuckDB happily allows it, but the result combines both struct types, meaning the output (even the type of it) depends on the names of the fields:

duckdb> SELECT COALESCE({'key1': 'value1', 'key2': 42}, {'key3': 'value2', 'key4': 44});

│ COALESCE(main.struct_pack(key1 := 'value1', key2 := 42), main.struct_pack(key3 := 'value2', key4 := 44)) │
│ {key1: value1, key2: 42, key4: , key3: }                                                                 │

Spark-sql for 3.5.1 complains about mismatching types

spark-sql (default)> SELECT COALESCE(named_struct('key1', 'value1', 'key2', 42), named_struct('key3', 'value2', 'key4', 44));
[DATATYPE_MISMATCH.DATA_DIFF_TYPES] Cannot resolve "coalesce(named_struct(key1, value1, key2, 42), named_struct(key3, value2, key4, 44))" due to data type mismatch: Input to `coalesce` should all be the same type, but it's ("STRUCT<key1: STRING, key2: INT>" or "STRUCT<key3: STRING, key4: INT>").; line 1 pos 7;
'Project [unresolvedalias(coalesce(named_struct(key1, value1, key2, 42), named_struct(key3, value2, key4, 44)), None)]
+- OneRowRelation

@drin
Copy link
Member

drin commented Apr 14, 2025

SELECT COALESCE( {'key1': 'value1', 'key2': 42}
                ,{'key3': 'value2', 'key4': 44});

Unfortunately, my dev version of duckdb is not based on a recent version at the moment. Could you (or someone) produce it and include it here? If I find some extra time laying around I can do some environment munging to do it, I just don't have any right this moment.

I think the following will produce the binary blob:

CALL get_substrait("
  SELECT COALESCE(
     { 'key1': 'value1', 'key2': 42}
    ,{ 'key3': 'value2', 'key4': 44}
  );
");

and if you put that in a script and redirect to a file you should get the serialized plan:

duckdb < create-substrait.sql > test.substrait

and hopefully you can share the stringified version (json or just protobuf's stringification should be fine).

@ingomueller-net
Copy link
Contributor

I'd like to echo Jacques comment on distinguishing plan validation and plan representation better. As I see it, a source system needs to validate a plan in its original representation itself and before converting it to Substrait. Once it knows that the plan is valid and is sure about what the semantics are, it can translate the plan into a Substrait plan with the same semantics. At that point, it should not have to do any validation of its own types anymore. This also applies to naming: it needs to find out what names refer to before Substrait comes into the picture. When converting that plan to Substrait, It has to translate these names into positions, at which point the names cannot be necessary anymore.

To concretely point out how this should be applied in some of the examples given above:

The input rel has names (either because it reads from a parquet file where all columns and inner fields are named, or because it's a VirtualTableScan with names, etc). The literal doesn't have names. However at least in DataFusion, coalesce requires the types of the inputs to match, which isn't possible (or at least I can't see how) with Substrait today w/o writing a special implementation for coalesce() which deals with this case by harmonizing the struct types.

I think that this is something that DataFusion has to validate or otherwise deal with before it constructs the Substrait plan.

Substrait would happily consider these the same type (NamedTable doesn't contain column names so there's nothing in the Substrait plan saying this is wrong), and allow e.g. coalesce(LEFT, RIGHT). But in real life, the types are not the same, and they're not compatible and cannot be coalesced together.

Again: If these aren't the same type in the source system, then that's about the semantics of that system and that system needs to verify its semantics itself. Once it has figured out whether a plan is valid in its own representation, it can translate it to the Substrait equivalent.

@Blizzara
Copy link
Contributor Author

Substrait would happily consider these the same type (NamedTable doesn't contain column names so there's nothing in the Substrait plan saying this is wrong), and allow e.g. coalesce(LEFT, RIGHT). But in real life, the types are not the same, and they're not compatible and cannot be coalesced together.

Again: If these aren't the same type in the source system, then that's about the semantics of that system and that system needs to verify its semantics itself. Once it has figured out whether a plan is valid in its own representation, it can translate it to the Substrait equivalent.

My point here was that the executability-validity of a Substrait plan depended on the input data in a way that I thought was impossible to validate based on just the Substrait plan itself; however, I was wrong on this point I think - ReadRel (so also NamedTable) contains a NamedStruct so it has the input names, and the scenario I outlined there wasn't valid.

The input rel has names (either because it reads from a parquet file where all columns and inner fields are named, or because it's a VirtualTableScan with names, etc). The literal doesn't have names. However at least in DataFusion, coalesce requires the types of the inputs to match, which isn't possible (or at least I can't see how) with Substrait today w/o writing a special implementation for coalesce() which deals with this case by harmonizing the struct types.
I think that this is something that DataFusion has to validate or otherwise deal with before it constructs the Substrait plan.
It's not about constructing the Substrait plan, that part is easy. All of this is about consuming that plan.

@ingomueller-net
Copy link
Contributor

My point here was that the executability-validity of a Substrait plan depended on the input data in a way that I thought was impossible to validate based on just the Substrait plan itself; however, I was wrong on this point I think - ReadRel (so also NamedTable) contains a NamedStruct so it has the input names, and the scenario I outlined there wasn't valid.

I don't fully understand. The names provided in ReadRel should have no relevance outside that rel: they tell the ReadRel which fields it should produce for names that are defined outside of Substrait plan, say, in the catalog of the database the query should run against. These names are then not visible or necessary for rels that consume the result of that ReadRel. Don't you agree?

@Blizzara
Copy link
Contributor Author

My point here was that the executability-validity of a Substrait plan depended on the input data in a way that I thought was impossible to validate based on just the Substrait plan itself; however, I was wrong on this point I think - ReadRel (so also NamedTable) contains a NamedStruct so it has the input names, and the scenario I outlined there wasn't valid.

I don't fully understand. The names provided in ReadRel should have no relevance outside that rel: they tell the ReadRel which fields it should produce for names that are defined outside of Substrait plan, say, in the catalog of the database the query should run against. These names are then not visible or necessary for rels that consume the result of that ReadRel. Don't you agree?

The names are relevant for whether e.g. coalescing two columns with Type.Struct is valid. Since many (if not all?) DB systems considers structs to have named fields, if the inner field names were not included in ReadRels, one could not say if a plan doing coalesce(struct_column_1, struct_column_2) is valid or not. Given they are, it might be. (This probably depends a bit on how one defines "plan validity", so we may disagree.)

@EpsilonPrime
Copy link
Member

If there is a coalesce operation that works on structs to take the first non-nullable item of each struct item it would work over the positions of each sub item. So {$1:1}, {$2:2} would become {$1:1,$2:2,$3:NULL} where the input and output type would be a struct with three fields defined.

@ingomueller-net
Copy link
Contributor

The names are relevant for whether e.g. coalescing two columns with Type.Struct is valid. Since many (if not all?) DB systems considers structs to have named fields, if the inner field names were not included in ReadRels, one could not say if a plan doing coalesce(struct_column_1, struct_column_2) is valid or not. Given they are, it might be. (This probably depends a bit on how one defines "plan validity", so we may disagree.)

What is the exact semantic of such a function? The one defined in this repo does not recurse into the struct fields, i.e., only considers if the struct as whole is NULL or not, (and probably expects structs of the same type), so that function does not need to know names.

@Blizzara
Copy link
Contributor Author

The names are relevant for whether e.g. coalescing two columns with Type.Struct is valid. Since many (if not all?) DB systems considers structs to have named fields, if the inner field names were not included in ReadRels, one could not say if a plan doing coalesce(struct_column_1, struct_column_2) is valid or not. Given they are, it might be. (This probably depends a bit on how one defines "plan validity", so we may disagree.)

What is the exact semantic of such a function? The one defined in this repo does not recurse into the struct fields, i.e., only considers if the struct as whole is NULL or not, (and probably expects structs of the same type), so that function does not need to know names.

@ingomueller-net For my case, exactly that. I'm not talking about recursing into the struct fields. (The simplest example of my case is #800 (comment)). The problem is this:

and probably expects structs of the same type

Which is an issue for two reasons:

  1. According to substrait currently, a struct {"a": int64} is in effect the same type as a struct {"b": int64}, but that is not the case for any the systems I'm working with (if you know of a system where that is the case, I'm interested to hear!).
  2. Substrait has no way of specifying the named type of a literal struct (in actual Literal for not-null, or in Type for null, and empty array etc). Thus for a case where a struct coming from input data is being fed into e.g. a coalesce together with a literal struct, it's impossible in Substrait to have the types be the same. (Again, see feat: add names into struct type #800 (comment))

@ingomueller-net
Copy link
Contributor

OK, that's helpful clarification. To summarize: You have a query like SELECT coalesce(a, b) FROM t and need to find out if a and b have the same type given that they are both struct types, correct? In particular, two struct types are only equivalent if their field types and names match, correct?

I think that this is a question for the source system and should be done there, before Substrait gets into the picture. I imagine the flow should be roughly like this:

  1. Parse SELECT coalesce(a, b) FROM t into the plan representation of the source system.
  2. Validate everything that needs to be validated, resolve names, etc.
    • This includes finding out the types of a and b and verifying whether they follow the rules of coalesce of the source system.
    • Abort the process if the types do not match.
  3. Convert the internal plan into a Substrait plan.
    • This includes mapping the source system's coalesce to Substrait's coalesce (assuming that they have indeed the same semantics -- after the plan is valid).
    • This includes converting types from the source system to equivalent Substrait types. In particular, named structs from the source system are mapped to positional structs in Substrait -- if we get here, the names of the structs are guaranteed to match.

Is there any reason why you can't do it like this in your case?

@Blizzara
Copy link
Contributor Author

OK, that's helpful clarification. To summarize: You have a query like SELECT coalesce(a, b) FROM t and need to find out if a and b have the same type given that they are both struct types, correct? In particular, two struct types are only equivalent if their field types and names match, correct?

Is there any reason why you can't do it like this in your case?

So the case you outline is the other example I had, #800 (comment), where both columns come from the input data. That case is actually fine, since Substrait ReadRel contains a NamedStruct i.e. it has inner names for both columns.

The problematic case is SELECT (coalesce(a, array()) where a is of type Array[Struct[..]]. In this case Substrait has no way of providing the empty array literal in a way that it's type would match a's type in the destination system.

@ingomueller-net
Copy link
Contributor

The problematic case is SELECT (coalesce(a, array()) where a is of type Array[Struct[..]]. In this case Substrait has no way of providing the empty array literal in a way that it's type would match a's type in the destination system.

I think my comment applies to both because the checking of the struct names has to be done before the plan gets converted to Substrait. In other words, when you do that validation, you should not have to rely on Substrait to provide you the names of the structs.

For your specific example, I think what you should do is:

  1. Find the type of a in your source system. Say it is [{ x: int, y: int }].
  2. Find the type of array() in your source system.
    • I don't know how it will find out in this case: I guess it will understand that the two arguments of coalesce must have the same type and, thus, take the type of the first argument. Maybe array() is actually simplified example and the full syntax would be something like array<struct<x int, y int>>().
  3. I the types do not match, abort the translation.
  4. Convert the array type of named structs [{ x: int, y: int }] from the source system to Substrait's array type of positional struct [{int, int}] (pseudo-code) -- no checking of names need here or later on.

@Blizzara
Copy link
Contributor Author

For your specific example, I think what you should do is:

  1. Find the type of a in your source system. Say it is [{ x: int, y: int }].
  2. Find the type of array() in your source system.

That's fine, the source system handles this at analysis phase (or something like that, I assume).

  1. I the types do not match, abort the translation.
  2. Convert the array type of named structs [{ x: int, y: int }] from the source system to Substrait's array type of positional struct [{int, int}] (pseudo-code) -- no checking of names need here or later on.

So the answer here is to strip the names from the input in the consumer system, so that everything then matches Substrait's namelessness. I think that would work, but it feels quite hacky, and is quite unfortunate for anyone trying to read the plans in the consumer system.

Why is that a better solution than adding an ability to include the names in places where they'd be useful?

@ingomueller-net
Copy link
Contributor

So the answer here is to strip the names from the input in the consumer system, so that everything then matches Substrait's namelessness. I think that would work, but it feels quite hacky, and is quite unfortunate for anyone trying to read the plans in the consumer system.

Awesome! If we agree here, we made a huge step forward 😄

Why is that a better solution than adding an ability to include the names in places where they'd be useful?

I can only speculate but my first reflex would be to say that if structs had names, we would suddenly have to take them into account, for example, for type checking. They are also longer and thus go against the goal of being concise, which plays some role for a serialization format.

@drin
Copy link
Member

drin commented Apr 17, 2025

I've thought about this more and I think we've all reached an understanding that Type.Struct and ReadRel do not need to be changed (or at least the proposed changes don't feel compelling). Specifically, moving names into Type.Struct from NamedStruct is unnecessary.

I don't see a problem with a "named struct" as a literal, in which case adding a names field to Literal.Struct seems reasonable. However, named structs could also be a ReadRel with a NamedStruct representing the field names and the data as being a VirtualTable. If the named structs are actually stored with the engine, this approach seems much better than embedding data directly into the substrait plan. However, a Literal.Struct with names makes sense if the named struct has only a few tuples (as in all of the duckdb examples DuckDB structs).

The alternate to adding a names field to Literal.Struct is to add a Literal.NamedStruct message, but I think that would only be better if there is a stricter requirement on the presence of names. I can imagine functions where the names won't matter even if provided, so I think a new message is unnecessary.

@Blizzara
Copy link
Contributor Author

I've thought about this more and I think we've all reached an understanding that Type.Struct and ReadRel do not need to be changed (or at least the proposed changes don't feel compelling).

I still don't agree, but it's clear I'm in the minority here 😅

Specifically, moving names into Type.Struct from NamedStruct is unnecessary.
Sure, it was an example of if names were added into Type.Struct, that could be also changed, thus making the Substrait plan much more natural (and easier) to handle for at least Spark and DataFusion, the two engines I've worked a lot with.

I don't see a problem with a "named struct" as a literal, in which case adding a names field to Literal.Struct seems reasonable.

It'd actually need to be both literal and type, I think, for the change to make sense. For my example that sparked this issue, the case needs an empty array literal, which in Substrait is Literal.EmptyArray(type=Type.Struct), meaning there's no Literal.Struct at play.

However, named structs could also be a ReadRel with a NamedStruct representing the field names and the data as being a VirtualTable. If the named structs are actually stored with the engine, this approach seems much better than embedding data directly into the substrait plan. However, a Literal.Struct with names makes sense if the named struct has only a few tuples (as in all of the duckdb examples DuckDB structs).

VirtualTables already contain the names through the ReadRel's they're part of, that's fine. Doesn't do anything to help literals and types though.

The alternate to adding a names field to Literal.Struct is to add a Literal.NamedStruct message, but I think that would only be better if there is a stricter requirement on the presence of names. I can imagine functions where the names won't matter even if provided, so I think a new message is unnecessary.

I don't understand why having two types would be better than adding the names (it can be optional for all I care!) to the existing type 😅

@drin
Copy link
Member

drin commented Apr 17, 2025

I don't understand why having two types would be better than adding the names

It's not, I think you misread my comment.

VirtualTables... Doesn't do anything to help literals and types though.

VirtualTables can contain literals. VirtualTable -> Expression.Nested.Struct -> Literal.

For my example that sparked this issue, the case needs an empty array literal...

I already said that adding names to Literal.Struct seems good to me. Then you can have a Literal.List which contains Literal.Structs each with names. Does that not work?

Edit: an empty list seems fine to me too. If it has no structs then those structs have no names?

@jacques-n
Copy link
Contributor

There is a lot of activity here and I'm traveling so haven't been able to give it fair attention (and will continue to be traveling for another week or so).

My main feeling is that @Blizzara feels like it is too hard to map his plans to substrait plans. I think we should always start by trying to fix this at the language level as opposed to changing the specification.

I could see an argument to add a expression level field that is a hint for output names so you can name intermediate expressions (given a+b+c, be able to name the a+b part of it). I also think we should better clarify that the hint here should be the same traversal order for nested structs as RelRoot. (It's implied but not stated explicitily.)

With that and better language level apis, I think @Blizzara's frustration would be addressed.

@westonpace
Copy link
Member

Trying to catch up. Would it be fair to say the crux of the issue is:

If a function or relation requires struct type names to determine the output type then Substrait's current form is insufficient.

I think this is the problem we are encountering with coalesce up above. The return type of the function depends on the field names of the input structs.

@drin
Copy link
Member

drin commented Apr 17, 2025

The return type of the function depends on the field names of the input structs.

I could definitely see that Expression.output_type is not capable of propagating names of struct fields, so that is a case where likely Type.Struct would need the additional field or we would want to change Expression.output_type from a Type to a oneof { Type, NamedStruct }.

@Blizzara
Copy link
Contributor Author

Blizzara commented Apr 18, 2025

There was an ask for a more concrete example - so here's a Spark query that fails to roundtrip in current substrait-spark, and fails to execute in DataFusion.

  test("coalesce") {
    assertSqlSubstraitRelRoundTrip(
      "select coalesce(struct_col, array()) from (values (array(named_struct('inner_struct_field', 'a'))), (null) as table(array_struct_col))"
    )
  }

The Substrait plan created is as follows:

{
  "extensionUris": [{
    "extensionUriAnchor": 1,
    "uri": "/functions_comparison.yaml"
  }],
  "extensions": [{
    "extensionFunction": {
      "extensionUriReference": 1,
      "functionAnchor": 0,
      "name": "coalesce:any"
    }
  }],
  "relations": [{
    "root": {
      "input": {
        "project": {
          "common": {
            "emit": {
              "outputMapping": [1]
            },
            "hint": {
              "alias": "",
              "outputNames": ["coalesce(array_struct_col, array())", "inner_struct_field"],
              "savedComputations": [],
              "loadedComputations": []
            }
          },
          "input": {
            "read": {
              "common": {
                "direct": {
                }
              },
              "baseSchema": {
                "names": ["array_struct_col", "inner_struct_field"],
                "struct": {
                  "types": [{
                    "list": {
                      "type": {
                        "struct": {
                          "types": [{
                            "string": {
                              "typeVariationReference": 0,
                              "nullability": "NULLABILITY_REQUIRED"
                            }
                          }],
                          "typeVariationReference": 0,
                          "nullability": "NULLABILITY_REQUIRED"
                        }
                      },
                      "typeVariationReference": 0,
                      "nullability": "NULLABILITY_NULLABLE"
                    }
                  }],
                  "typeVariationReference": 0,
                  "nullability": "NULLABILITY_REQUIRED"
                }
              },
              "virtualTable": {
                "values": [{
                  "fields": [{
                    "list": {
                      "values": [{
                        "struct": {
                          "fields": [{
                            "string": "a",
                            "nullable": false,
                            "typeVariationReference": 0
                          }]
                        },
                        "nullable": false,
                        "typeVariationReference": 0
                      }]
                    },
                    "nullable": false,
                    "typeVariationReference": 0
                  }]
                }, {
                  "fields": [{
                    "null": {
                      "list": {
                        "type": {
                          "struct": {
                            "types": [{
                              "string": {
                                "typeVariationReference": 0,
                                "nullability": "NULLABILITY_REQUIRED"
                              }
                            }],
                            "typeVariationReference": 0,
                            "nullability": "NULLABILITY_REQUIRED"
                          }
                        },
                        "typeVariationReference": 0,
                        "nullability": "NULLABILITY_NULLABLE"
                      }
                    },
                    "nullable": false,
                    "typeVariationReference": 0
                  }]
                }],
                "expressions": []
              }
            }
          },
          "expressions": [{
            "scalarFunction": {
              "functionReference": 0,
              "args": [],
              "outputType": {
                "list": {
                  "type": {
                    "struct": {
                      "types": [{
                        "string": {
                          "typeVariationReference": 0,
                          "nullability": "NULLABILITY_REQUIRED"
                        }
                      }],
                      "typeVariationReference": 0,
                      "nullability": "NULLABILITY_REQUIRED"
                    }
                  },
                  "typeVariationReference": 0,
                  "nullability": "NULLABILITY_REQUIRED"
                }
              },
              "arguments": [{
                "value": {
                  "selection": {
                    "directReference": {
                      "structField": {
                        "field": 0
                      }
                    },
                    "rootReference": {
                    }
                  }
                }
              }, {
                "value": {
                  "literal": {
                    "emptyList": {
                      "type": {
                        "struct": {
                          "types": [{
                            "string": {
                              "typeVariationReference": 0,
                              "nullability": "NULLABILITY_REQUIRED"
                            }
                          }],
                          "typeVariationReference": 0,
                          "nullability": "NULLABILITY_REQUIRED"
                        }
                      },
                      "typeVariationReference": 0,
                      "nullability": "NULLABILITY_REQUIRED"
                    },
                    "nullable": false,
                    "typeVariationReference": 0
                  }
                }
              }],
              "options": []
            }
          }]
        }
      },
      "names": ["coalesce(array_struct_col, array())", "inner_struct_field"]
    }
  }],
  "expectedTypeUrls": [],
  "parameterBindings": []
}

The error in DF shows the issue with inner field names, "inner_struct_field" vs "c0":

No function matches the given name and argument types 'coalesce(
    List(Field { name: \"item\", data_type: Struct([Field { name: \"inner_struct_field\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }),
    List(Field { name: \"item\", data_type: Struct([Field { name: \"c0\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
)'. You might need to add explicit type casts.\n\tCandidate functions:\n\tcoalesce(UserDefined)")

@drin
Copy link
Member

drin commented Apr 18, 2025

I assume

values
   (
     array(
       named_struct('inner_struct_field', 'a')
     )
   )
  ,(null)
as table(array_struct_col)

should construct a virtual table like so:

Row ID array_struct_col
1 ['a']
2 null

That seems to work as expected.

And then coalesce(struct_col, array()) seems to correctly reference the "array_struct_col" field:

...
"arguments": [
  {
    "value": {
      "selection": {
        "directReference": { "structField": { "field": 0 } },
        "rootReference": {}
      }
    }
  }
...

The second argument to coalesce is the empty list literal:

...
, {
     "value": {
       "literal": {
         "emptyList": {
           "type": {
             "struct": {
               "types": [  { "string": { ... } }  ]
               ,...
             }
           }
           ,...
         }
         ,...
       }
     }
  }
]
...

The second argument to DF's coalesce function assigns the column "c0" to this one.

The direct change to solve this issue would be to add names to Type.Struct to allow producers to be explicit with the struct field names.

If we look at the relevant Literal fields, then we see a few interesting things:

  • empty literals can have types (e.g. Type.List empty_list and Type null)
  • non-empty literals don't have types (e.g. Struct struct and List list)

What really makes this scenario an unfortunate combination of things is:

  • the desired literal is an empty list of structs
  • there is no NamedStruct type, so a NamedStruct can't be used for a field of Type
  • even adding a NamedStruct empty_struct field doesn't provide a mechanism of having a List[NamedStruct]

A "hacky" solution is to add a repeated string names field to Literal.Struct and for the producer to use Literal.List[Literal.Struct] but without actually populating values into it.

Further, ScalarFunction.output_type would need to accommodate named structs, meaning that a solution must either be part of the type system or the output type should be expanded to literals (which sort of feels like 1.5x "hackyness" and not quite 2x "hackyness" to me).

Changing Type.Struct trickles quite far, and makes the existence of NamedStruct weird to me. But aside from my personal dislike, both can technically exist and it may not be much of an issue. Creating a Type.NamedStruct seems equally weird to me.

@ingomueller-net
Copy link
Contributor

There was an ask for a more concrete example - so here's a Spark query that fails to roundtrip in current substrait-spark, and fails to execute in DataFusion.

  test("coalesce") {
    assertSqlSubstraitRelRoundTrip(
      "select coalesce(struct_col, array()) from (values (array(named_struct('inner_struct_field', 'a'))), (null) as table(array_struct_col))"
    )
  }

The Substrait plan created is as follows:

...

The error in DF shows the issue with inner field names, "inner_struct_field" vs "c0":

No function matches the given name and argument types 'coalesce(
    List(Field { name: \"item\", data_type: Struct([Field { name: \"inner_struct_field\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }),
    List(Field { name: \"item\", data_type: Struct([Field { name: \"c0\", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
)'. You might need to add explicit type casts.\n\tCandidate functions:\n\tcoalesce(UserDefined)")

That's very useful -- thanks a lot!

I think the problem is neither in the Substrait spec nor in the translation from Spark SQL to Substrait; the plan you provide looks like what I would expect.

What I think is going wrong is the translation from the Substrait spec to the internal plan format of DataFusion. It seems like in that conversion, the positional structs are converted to named structs, and they are converted incorrectly: of two structs that should have the same type, one has the field name inner_struct_field and the other one has the field name c0. I think that the latter is incorrect. My guess is that the conversion just picks a generic field name if the field is otherwise unnamed and that's what leads to an incorrect DF (!) plan. What the conversion should do instead, I think, is to realize that the struct type from the literal, where no field name is available, is the same struct type as that of the other argument of coalasc, so as per the rules of DF it needs to have the same field names and go and fetch them from there.

@drin
Copy link
Member

drin commented Apr 23, 2025

@ingomueller-net so you think that an empty named_struct should always reflect the other input? I suppose that seems consistent given that the expectation seems to be that the producer would need to do the inverse to encode an empty named_struct in the way that @Blizzara is describing.

Specifically, if the producer would encode the array() in coalesce(struct_col, array()) as having the name inner_struct_field, then the consumer can infer the opposite and resolve the typed empty_list to the same type as what it's being coalesced with.

I don't think this helps with the general use case, though? when the argument to coalesce isn't an empty array, it seems like there's no valid way to include the struct field names. I'd be interested to see what substrait-spark attempts to produce though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants