[PROPOSAL]: Support indexes on nested arrays of struct

## Problem Statement

Hyperspace does NOT support indexing over columns/fields of type array of struct.

## Background and Motivation

There are use cases where queries like the ones bellow can greatly benefit performance wise from Hyperspace's indexing.

```
df1.filter(array_contains(df1("nested.cnt"), 11)).show
```

```
df1.join(df2, array_contains(df1("nested.cnt"), df2("rank")))
```

## Proposed Solution

This design is based on the work of proposal #347 and #365.

Given the following dataset

```
root
 |-- id: string (nullable = true)
 |-- v: integer (nullable = true)
 |-- nested: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- cnt: integer (nullable = true)
```

```
+---+---+--------------------+
|id |v  |nested              |
+---+---+--------------------+
|id2|100|[[b1, 1]]           |
|id1|100|[[a1, 1], [a2, 2]]  |
|id3|110|[[c1, 11], [c2, 12]]|
+---+---+--------------------+
```

Anyone should be able to create an index with:

```
hs.createIndex(
  nestedDataset, 
  IndexConfig(
    "idx_arr_nested", 
    indexedColumns = Seq("nested.cnt"), 
    includedColumns = Seq("id")))
```

## Alternatives
N/A

## Known/Potential Compatibility Issues
N/A

## Design


Adding the support for nested fields impacts the following areas:

- Validate nested column names
- Modify the create index action
- Modify the filter and rule index functions

## Creating the index

As seen above the index may be created with already existing Hyperspace APIs.

:warning: ***Important*** :warning: 

*It will NOT be possible to support multiple fields from a nested array due to the way Spark is working right now. This limitation is because arrays need to be exploded to create the proper index and Spark allows `only one generator allowed per select clause`.*

Under the hood the index may be created with something like this:

```
df1.select(explode(df1("nested.cnt"), df1("id")).as("nested__cnt"))
```

The resulted data frame will be:

```
root
 |-- nested__cnt: integer (nullable = true)
 |-- id: string (nullable = true)
```

```
+-----------+---+
|nested__cnt| id|
+-----------+---+
|          1|id2|
|          1|id1|
|          2|id1|
|         11|id3|
|         12|id3|
+-----------+---+
```

It is important to understand that the name of the field of the index column is a non-nested column and due to parquet quirkiness on using `.` (dot) in the field name, it has to be properly renamed and at query time projected as it was. This will be based on the renaming pattern already implemented in #365.

### Search query

Given the following search/filter query

```
df1.filter(array_contains(df1("nested.cnt"), 11)).select("id")
```

More to come...

### Join Queries

TDB

## Implementation

1. First it will have to add the support of creating indexes over arrays.
2. Then support search queries + hybrid scans
3. Then support join queries + hybrid scans

## Performance Implications (if applicable)
N/A


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[PROPOSAL]: Support indexes on nested arrays of struct #372

Problem Statement

Background and Motivation

Proposed Solution

Alternatives

Known/Potential Compatibility Issues

Design

Creating the index

Search query

Join Queries

Implementation

Performance Implications (if applicable)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[PROPOSAL]: Support indexes on nested arrays of struct #372

Description

Problem Statement

Background and Motivation

Proposed Solution

Alternatives

Known/Potential Compatibility Issues

Design

Creating the index

Search query

Join Queries

Implementation

Performance Implications (if applicable)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions