Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 24 additions & 18 deletions website/docs/features/data-acceleration/hash-index.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ The hash index is an optional, high-performance indexing feature for Arrow-accel

## Configuration

To use the hash index, explicitly enable it and specify a primary key:
Hash indexing activates automatically on Arrow-accelerated datasets when a `primary_key` or [secondary index](./indexes) is configured. No additional parameter is required.

```yaml
datasets:
Expand All @@ -31,10 +31,14 @@ datasets:
acceleration:
engine: arrow
primary_key: order_id
params:
hash_index: enabled
```

The hash index activates whenever:

- `engine` is `arrow` or `partitioned_arrow`,
- `acceleration.enabled` is `true`,
- and either `indexes` is set, or `primary_key` is set with a non-`caching` `refresh_mode`.

### Secondary Indexes

Secondary indexes can be added on non-primary-key columns to accelerate equality lookups on those columns. Define them using the [`indexes`](./indexes) field in the acceleration configuration:
Expand All @@ -50,8 +54,6 @@ datasets:
email: unique
status: enabled
'(region, category)': unique
params:
hash_index: enabled
```

Index types:
Expand All @@ -67,11 +69,14 @@ Only single-column `unique` secondary indexes currently accelerate queries. Non-

### Configuration Options

| Parameter | Type | Required | Default | Description |
| ------------- | -------------------- | --------------------------- | ---------- | -------------------------------------------- |
| `hash_index` | `enabled`/`disabled` | No | `disabled` | Enable hash indexing |
| `primary_key` | string or list | Yes (if hash_index enabled) | None | Column(s) for the primary key index |
| `indexes` | YAML map | No | None | Secondary indexes (see [indexes](./indexes)) |
| Parameter | Type | Required | Default | Description |
| ------------- | -------------- | ---------------------------------------- | ------- | -------------------------------------------- |
| `primary_key` | string or list | Yes (unless `indexes` is set) | None | Column(s) for the primary key index |
| `indexes` | YAML map | No | None | Secondary indexes (see [indexes](./indexes)) |

:::note `hash_index` parameter is ignored
The legacy `hash_index: enabled` parameter is accepted but no longer activates indexing on its own. When set, the runtime logs a warning and falls back to the automatic rules above. Remove `hash_index` from `params` to clear the warning.
:::

## Supported Data Types

Expand Down Expand Up @@ -239,22 +244,23 @@ Uses XXH3_64 with a fixed seed (`0x5370_6963_6541_4920` = "SpiceAI ") for:

**Solution**: This is expected behavior for small datasets. The full scan is faster than index overhead.

### Warning: "Add 'hash_index: enabled' to use primary_key for fast lookups"
### Warning: "The hash_index acceleration parameter is ignored for Arrow acceleration"

**Cause**: `primary_key` is specified but `hash_index` is not enabled.
**Cause**: `hash_index: enabled` is set in `params` but no longer activates indexing on its own.

**Solution**: Add `hash_index: enabled` to `params`:
**Solution**: Remove `hash_index` from `params`. Hash indexing activates automatically when `primary_key` or `indexes` is configured on an Arrow-accelerated dataset (see [Configuration](#configuration)).

```yaml
params:
hash_index: enabled
```
### Hash index not active despite `primary_key` being set

**Cause**: `refresh_mode: caching` disables hash indexing even when `primary_key` is set; the caching path uses its own lookup strategy.

**Solution**: Use a non-caching `refresh_mode` (e.g. `full`, `append`, `changes`) for datasets that need point-lookup acceleration via the hash index.

### High Memory Usage

**Cause**: Index consumes ~17 bytes per row.

**Solution**:

- Disable hash_index for datasets where point lookups are rare
- Remove `primary_key` for datasets where point lookups are rare (hash indexing stops being applied)
- Consider using a different acceleration engine for very large datasets
Loading