Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 75 additions & 12 deletions site/docs/extensions/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,15 +132,78 @@ The `any[\d]` types (i.e. `any1`, `any2`, ..., `any9`) impose an additional rest

## Advanced Extensions

Less common extensions can be extended using customization support at the serialization level. This includes the following kinds of extensions:

| Extension Type | Description |
| ------------------------------------ | ------------------------------------------------------------ |
| Relation Modification (semantic) | Extensions to an existing relation that will alter the semantics of that relation. These kinds of extensions require that any plan consumer understand the extension to be able to manipulate or execute that operator. Ignoring these extensions will result in an incorrect interpretation of the plan. An example extension might be creating a customized version of Aggregate that can optionally apply a filter before aggregating the data. <br /><br />Note: Semantic-changing extensions shouldn't change the core characteristics of the underlying relation. For example, they should *not* change the default direct output field ordering, change the number of fields output or change the behavior of physical property characteristics. If one needs to change one of these behaviors, one should define a new relation as described below. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These old content looks good to me (at least the guideline parts). Perhaps, we could reorganize them in each section and call out as notes.

| Relation Modification (optimization) | Extensions to an existing relation that can improve the efficiency of a plan consumer but don't fundamentally change the behavior of the operation. An example might be an estimated amount of memory the relation is expected to use or a particular algorithmic pattern that is perceived to be optimal. |
| New Relations | Creates an entirely new kind of relation. It is the most flexible way to extend Substrait but also make the Substrait plan the least interoperable. In most cases it is better to use a semantic changing relation as oppposed to a new relation as it means existing code patterns can easily be extended to work with the additional properties. |
| New Read Types | Defines a new subcategory of read that can be used in a ReadRel. One of Substrait is to provide a fairly extensive set of read patterns within the project as opposed to requiring people to define new types externally. As such, we suggest that you first talk with the Substrait community to determine whether you read type can be incorporated directly in the core specification. |
| New Write Types | Similar to a read type but for writes. As with reads, the community recommends that interested extenders first discuss with the community about developing new write types in the community before using the extension mechanisms. |
| Plan Extensions | Semantic and/or optimization based additions at the plan level. |

Because extension mechanisms are different for each serialization format, please refer to the corresponding serialization sections to understand how these extensions are defined in more detail.
Advanced extensions provide a way to embed custom functionality that goes beyond the standard YAML-based simple extensions. Unlike simple extensions, advanced extensions use Protocol Buffer's `google.protobuf.Any` type to embed arbitrary extension data directly into Substrait messages.

### How Advanced Extensions Work

Advanced extensions come in several main forms, discussed below:

1. Embedded extensions: These use the `AdvancedExtension` message for adding custom data to existing Substrait elements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure or don't remember whether element is commonly used across the doc. Now that I check again, advanced extensions are generously sprinkled over the proto file... :) That said, it would be clearer to mention the scope of the embedded extension like adding custom data to enclosing Substrait message (or element).

The above mentioned protobuf and call out Any. So I'm wondering whether we can stick to Substraite message rather than using element.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, message makes more sense than element for consistency!

2. Custom relation types: For defining entirely new relational operations
3. Custom read/write types: for defining new ways to read from or write to data sources

### Embedded Extensions via `AdvancedExtension`

The simplest forms of advanced extensions use the `AdvancedExtension` message, which contains two types of extensions:

```proto
message AdvancedExtension {
// Optimizations are helpful information that don't influence semantics.
// May be ignored by a consumer.
repeated google.protobuf.Any optimization = 1;

// Enhancements alter semantics. Cannot be ignored by a consumer.
google.protobuf.Any enhancement = 2;
}
```

#### Optimizations

- Provide hints to improve performance but don't change the meaning of operations
- Can be safely ignored by consumers that don't understand them
- Examples: memory usage hints, preferred algorithms, caching strategies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: example goes last?

- Multiple optimizations can be attached to a single element

#### Enhancements

- Modify the semantic behavior of operations
- Must be understood by consumers or the plan cannot be executed correctly
- Examples: custom aggregation logic, specialized join conditions, new relation types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relational operators? or just custom operators?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can clarify a bit here!

- Only one enhancement per element

#### Where AdvancedExtension Messages Can Be Used

The `AdvancedExtension` message can be attached to various parts of a Substrait plan:

| Location | Usage |
| --------------------------------- | ------------------------------------------- |
| **`Plan`** | Global extensions affecting the entire plan |
| **`RelCommon`** | Extensions for any relational operator |
| **Relations** (e.g. `ProjectRel`) | Extensions for a specific relation type |
| **Hints** | Extensions within optimization hints |
| **`ReadRel.NamedTable`** | Add custom metadata to named table references |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove Add to be consistent with the rest of Usages? (i.e., either all starts with verbs or nouns)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like LocalFiles also has advanced extension.

| **`WriteRel.NamedObjectWrite`** | Add custom metadata to write targets |
| **`DdlRel.NamedObjectWrite`** | Add custom metadata to DDL targets |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noticed some of the advanced extension tags are not consistent in these Ddl related messages. Just note to self.


### Custom Relations

The second form of advanced extensions provides entirely new relational operations via dedicated extension relation types. These allow you to define custom relations while maintaining proper integration with the type system:

| Relation Type | Description | Examples |
| ---------------------- | ----------------------------------------------- | -------- |
| **`ExtensionLeafRel`** | Custom relations with no inputs | Custom table sources |
| **`ExtensionSingleRel`** | Custom relations with one input | Custom transforms |
| **`ExtensionMultiRel`** | Custom relations with multiple inputs | Custom joins |

These extension relations are first-class relation types in Substrait and can be used anywhere a standard relation would be used.

### Custom Read and Write Types

The third form of advanced extensions allows you to define extension data sources and destinations:

Comment on lines +216 to +217
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section can also have the note like above sections that checking with the community before going down the extension path. If the scenario turns out to be common enough, it may go directly to the specification.

| Extension Type | Description | Examples |
| ------------------ | ------------------------------------------- | ---------------------------------------------------- |
| **`ReadRel.ExtensionTable`** | Define entirely new table source types | APIs, specialized formats |
| **`WriteRel.ExtensionObject`** | Define entirely new write destination types | APIs, specialized formats |
| **`DdlRel.ExtensionObject`** | Define entirely new DDL destination types | Catalogs, schema registries |

Loading