Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

[PROPOSAL]: Support indexes on nested arrays of struct #372

@andrei-ionescu

Description

@andrei-ionescu

Problem Statement

Hyperspace does NOT support indexing over columns/fields of type array of struct.

Background and Motivation

There are use cases where queries like the ones bellow can greatly benefit performance wise from Hyperspace's indexing.

df1.filter(array_contains(df1("nested.cnt"), 11)).show
df1.join(df2, array_contains(df1("nested.cnt"), df2("rank")))

Proposed Solution

This design is based on the work of proposal #347 and #365.

Given the following dataset

root
 |-- id: string (nullable = true)
 |-- v: integer (nullable = true)
 |-- nested: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- cnt: integer (nullable = true)
+---+---+--------------------+
|id |v  |nested              |
+---+---+--------------------+
|id2|100|[[b1, 1]]           |
|id1|100|[[a1, 1], [a2, 2]]  |
|id3|110|[[c1, 11], [c2, 12]]|
+---+---+--------------------+

Anyone should be able to create an index with:

hs.createIndex(
  nestedDataset, 
  IndexConfig(
    "idx_arr_nested", 
    indexedColumns = Seq("nested.cnt"), 
    includedColumns = Seq("id")))

Alternatives

N/A

Known/Potential Compatibility Issues

N/A

Design

Adding the support for nested fields impacts the following areas:

  • Validate nested column names
  • Modify the create index action
  • Modify the filter and rule index functions

Creating the index

As seen above the index may be created with already existing Hyperspace APIs.

⚠️ Important ⚠️

It will NOT be possible to support multiple fields from a nested array due to the way Spark is working right now. This limitation is because arrays need to be exploded to create the proper index and Spark allows only one generator allowed per select clause.

Under the hood the index may be created with something like this:

df1.select(explode(df1("nested.cnt"), df1("id")).as("nested__cnt"))

The resulted data frame will be:

root
 |-- nested__cnt: integer (nullable = true)
 |-- id: string (nullable = true)
+-----------+---+
|nested__cnt| id|
+-----------+---+
|          1|id2|
|          1|id1|
|          2|id1|
|         11|id3|
|         12|id3|
+-----------+---+

It is important to understand that the name of the field of the index column is a non-nested column and due to parquet quirkiness on using . (dot) in the field name, it has to be properly renamed and at query time projected as it was. This will be based on the renaming pattern already implemented in #365.

Search query

Given the following search/filter query

df1.filter(array_contains(df1("nested.cnt"), 11)).select("id")

More to come...

Join Queries

TDB

Implementation

  1. First it will have to add the support of creating indexes over arrays.
  2. Then support search queries + hybrid scans
  3. Then support join queries + hybrid scans

Performance Implications (if applicable)

N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    proposalThis is the default tag for a newly created design proposaluntriagedThis is the default tag for a newly created issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions