-
Couldn't load subscription status.
- Fork 115
[PROPOSAL]: Support indexes on nested arrays of struct #372
Description
Problem Statement
Hyperspace does NOT support indexing over columns/fields of type array of struct.
Background and Motivation
There are use cases where queries like the ones bellow can greatly benefit performance wise from Hyperspace's indexing.
df1.filter(array_contains(df1("nested.cnt"), 11)).show
df1.join(df2, array_contains(df1("nested.cnt"), df2("rank")))
Proposed Solution
This design is based on the work of proposal #347 and #365.
Given the following dataset
root
|-- id: string (nullable = true)
|-- v: integer (nullable = true)
|-- nested: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- cnt: integer (nullable = true)
+---+---+--------------------+
|id |v |nested |
+---+---+--------------------+
|id2|100|[[b1, 1]] |
|id1|100|[[a1, 1], [a2, 2]] |
|id3|110|[[c1, 11], [c2, 12]]|
+---+---+--------------------+
Anyone should be able to create an index with:
hs.createIndex(
nestedDataset,
IndexConfig(
"idx_arr_nested",
indexedColumns = Seq("nested.cnt"),
includedColumns = Seq("id")))
Alternatives
N/A
Known/Potential Compatibility Issues
N/A
Design
Adding the support for nested fields impacts the following areas:
- Validate nested column names
- Modify the create index action
- Modify the filter and rule index functions
Creating the index
As seen above the index may be created with already existing Hyperspace APIs.
It will NOT be possible to support multiple fields from a nested array due to the way Spark is working right now. This limitation is because arrays need to be exploded to create the proper index and Spark allows only one generator allowed per select clause.
Under the hood the index may be created with something like this:
df1.select(explode(df1("nested.cnt"), df1("id")).as("nested__cnt"))
The resulted data frame will be:
root
|-- nested__cnt: integer (nullable = true)
|-- id: string (nullable = true)
+-----------+---+
|nested__cnt| id|
+-----------+---+
| 1|id2|
| 1|id1|
| 2|id1|
| 11|id3|
| 12|id3|
+-----------+---+
It is important to understand that the name of the field of the index column is a non-nested column and due to parquet quirkiness on using . (dot) in the field name, it has to be properly renamed and at query time projected as it was. This will be based on the renaming pattern already implemented in #365.
Search query
Given the following search/filter query
df1.filter(array_contains(df1("nested.cnt"), 11)).select("id")
More to come...
Join Queries
TDB
Implementation
- First it will have to add the support of creating indexes over arrays.
- Then support search queries + hybrid scans
- Then support join queries + hybrid scans
Performance Implications (if applicable)
N/A