- 
                Notifications
    
You must be signed in to change notification settings  - Fork 92
 
feat(isthmus): add dynamic function conversion for Substrait<->Calcite #457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
2f8f79a    to
    ad11957      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started taking a look at this today, but didn't have time to finish.
I'm going to restate what I think your doing with these changes to make sure that we're on the same page.
To start, you're generating SqlOperators dynamically from the function definitions in SimpleExtensions. You then use these to generate a SqlOperatorTable that covers functions that aren't already in the SubstraitOperatorTable.
This let's you do things like parse SQL queries with functions that are defined only in the SimpleExtension, and aren't part of the SubstraitOperatorTable.
With these definitions, you can also add additional mappings when converting from Substrait to Calcite, and back, for functions that aren't explicitly mapped in the static SCALAR_SIGS mapping.
Meta Comment
I find the usage of "custom" for this concept somewhat confusing. These aren't really custom functions because they are part of the core Substrait spec. A better word for this might be "dynamic". In my head, you're effectively generating SqlOperators dynamically based on the Substrait definitions, and then using them as fall-backs when explicit mappings.
| super(SubstraitOperatorTable.INSTANCE, catalogReader, catalogReader.getTypeFactory(), CONFIG); | ||
| } | ||
| 
               | 
          ||
| public SubstraitSqlValidator(SqlOperatorTable opTable, Prepare.CatalogReader catalogReader) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: It's slightly nicer API-wise if we re-order the arguments so that the constructors are:
public SubstraitSqlValidator(Prepare.CatalogReader catalogReader)
public SubstraitSqlValidator(Prepare.CatalogReader catalogReader, SqlOperatorTable opTable)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should call this SimpleExtensionToSqlOperator to more accurately capture what the conversion is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
| super(FEATURES_DEFAULT); | ||
| } | ||
| 
               | 
          ||
| public SubstraitToSql(List<String> yamlFunctionFiles) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be better to keep this API simple and just make it:
SubstraitToSql(SimpleExtension.ExtensionCollection extensions)Users can then handle loading and merging extensions any way they see fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
| SimpleExtension.ExtensionCollection customExtensionCollection = | ||
| ExtensionUtils.getCustomExtensions(extensions); | ||
| List<SqlOperator> generatedCustomOperators = | ||
| YamlToSqlOperator.from(customExtensionCollection, this.factory); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need to distinguish between the SqlOperators defined explicitly in the SubstraitOperatorTable and those that you are generating here?
I think we can just generate mappings for every function found in the extension collection, and then chain the operator tables like:
SqlOperatorTables.chain(
  SubstraitOperatorTable.INSTANCE,
  SqlOperatorTables.of(generatedCustomOperators)
);we automatically give precedence to the non-custom ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that precedence should be given the to default (non-custom) operators. What if users want to override them?
If I understood correctly, you are suggesting to generate SqlOperators for all extensions, the default ones and the dynamic:
List<SqlOperator> generatedCustomOperators =
        SimpleExtensionToSqlOperator.from(dynamicExtensionCollection, this.factory);I see two issues here:
- duplicated work (default operators already exist in SubstraitOperatorTable.INSTANCE
 - it requires the fully fledged TypeExpressionEvaluator that will probably be needed at some point anyway, but I guess this should be done in a different PR
 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that precedence should be given the to default (non-custom) operators.
I would argue that it's preferrable to bind to the built-ins when possible, because internally in both the RelBuilder and optimizer rules Calcite looks for specific operator instances when performing optimization (e.g. comparison simplification, aggregate rewrititing, etc). If we use dynamic operators instead of built-in ones, Calcite can no longer perform these optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I changed the order so that built-in operators have precedence.
| this(SimpleExtension.loadDefaults(), features); | ||
| } | ||
| 
               | 
          ||
| public SqlToSubstrait(SimpleExtension.ExtensionCollection extensions, FeatureBoard features) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should just be
public SqlToSubstrait(SqlOperatorTable operatorTable, FeatureBoard features)to give users direct control over the operator table they are using.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this constructor we need to call the constructor of the base class, SqlConverterBase that requires an instance of SimpleExtension.ExtensionCollection as an argument. Otherwise it uses SimpleExtension.loadDefaults() which will cause an inconsistency between the operatorTable and the extensionCollection fields, as the former will contain operators that don't have corresponding extensions in the latter.
          
 Yes, you've correctly summarized the changes. 
 I renamed all "custom" to "dynamic" in the changes, however in my mind User Defined Functions is always something custom :)  | 
    
2356755    to
    b232a95      
    Compare
  
    | TypeConverter typeConverter) { | ||
| List<SimpleExtension.ValueArgument> requiredArgs = | ||
| function.args().stream() | ||
| .filter(SimpleExtension.Argument::required) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is meaningless because it's always true if you look at the implementations. I'm not even sure what require is used for actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The required method is used here: 
| return args().stream().filter(Argument::required).collect(Collectors.toList()); | 
| List<SimpleExtension.ValueArgument> requiredArgs = | ||
| function.args().stream() | ||
| .filter(SimpleExtension.Argument::required) | ||
| .filter(t -> t instanceof SimpleExtension.ValueArgument) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it's valid to ignore Enum arguments when generating functions.
For example in extract
    impls:
      - args:
          - name: component
            options: [ YEAR, ISO_YEAR, US_YEAR, HOUR, MINUTE, SECOND,
                       MILLISECOND, MICROSECOND, SUBSECOND, PICOSECOND, UNIX_TIME, TIMEZONE_OFFSET ]
            description: The part of the value to extract.
          - name: x
            value: timestamp_tz
          - name: timezone
            description: Timezone string from IANA tzdb.
            value: string
        return: i64
the enumeration argument for component is part of the function signature. It might make sense to ignore function with enum arguments to start with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the code to handle EnumArgument, however I'm not sure if currently sql with functions using it can be translated to substrait, at least I couldn't manage to do it. Comments in this function indirectly mention that.
| TypeExpressionEvaluator.evaluateExpression( | ||
| returnExpression, function.args(), substraitArgTypes); | ||
| 
               | 
          ||
| return typeConverter.toCalcite(typeFactory, resolvedSubstraitType); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're not using the nullability of the function to determine if the output is nullable or not. We should probably account for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added handling of nullability.
| SqlKind.OTHER_FUNCTION, | ||
| returnTypeInference, | ||
| null, | ||
| OperandTypes.family(argFamilies), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this works fine for MIRROR and DECLARED_OUTPUT nullabilities, but will potentially overmatch on DISCRETE which puts constraints on the nullability of input arguments when matching functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added handling of all three types of nullability.
f2527e1    to
    e7a3d0a      
    Compare
  
    # Conflicts: # isthmus/src/main/java/io/substrait/isthmus/SqlToSubstrait.java # Conflicts: # isthmus/src/main/java/io/substrait/isthmus/SqlToSubstrait.java # isthmus/src/test/java/io/substrait/isthmus/OptimizerIntegrationTest.java # isthmus/src/test/java/io/substrait/isthmus/PlanTestBase.java
# Conflicts: # isthmus/src/main/java/io/substrait/isthmus/SqlToSubstrait.java
This PR introduces support for dynamic scalar User-Defined Functions (UDFs) for round-trip conversion between Substrait plans and Calcite RelNodes.
It establishes the Substrait YAML extension files as the single source of truth (SSoT) for all function signatures. The primary addition is a new utility, YamlToSqlOperator, which acts as a bridge between the Substrait and Calcite worlds by dynamically generating Calcite SqlOperator objects from the parsed YAML definitions.
The implementation is optimized to only generate these operators for functions that are truly dynamic —that is, functions present in the provided extensions but not already defined in the default SubstraitOperatorTable.
Key Changes:
New YamlToSqlOperator Utility: A type-safe converter that translates SimpleExtension.ExtensionCollection objects into a List.
Enhanced SqlToSubstrait (SQL → Substrait): The SqlToSubstrait constructor is updated to:
Identify dynamic UDFs by filtering the loaded extensions against the known operators in SubstraitOperatorTable.
Use YamlToSqlOperator to generate operators for these dynamic UDFs.
Inject the new operators into Calcite's validator via a сhained SqlOperatorTable.
Enhanced SubstraitToCalcite (Substrait → SQL): The SubstraitRelNodeConverter is similarly enhanced to:
Also identify dynamic UDFs from the provided SimpleExtension.ExtensionCollection.
Dynamically create FunctionMappings.Sig objects for these UDFs, enabling the ScalarFunctionConverter to correctly map them back to their SqlOperator during plan conversion.
Architectural Refactoring: The EXTENSION_COLLECTION in SqlConverterBase has been changed from static final to an instance field to allow for the dynamic configuration required by this feature.