Skip to content

Commit ce0fdf7

Browse files
authored
fix: Update hash index docs for auto-enable behavior (#1730)
1 parent 033cc95 commit ce0fdf7

1 file changed

Lines changed: 24 additions & 18 deletions

File tree

website/docs/features/data-acceleration/hash-index.md

Lines changed: 24 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ The hash index is an optional, high-performance indexing feature for Arrow-accel
2222

2323
## Configuration
2424

25-
To use the hash index, explicitly enable it and specify a primary key:
25+
Hash indexing activates automatically on Arrow-accelerated datasets when a `primary_key` or [secondary index](./indexes) is configured. No additional parameter is required.
2626

2727
```yaml
2828
datasets:
@@ -31,10 +31,14 @@ datasets:
3131
acceleration:
3232
engine: arrow
3333
primary_key: order_id
34-
params:
35-
hash_index: enabled
3634
```
3735
36+
The hash index activates whenever:
37+
38+
- `engine` is `arrow` or `partitioned_arrow`,
39+
- `acceleration.enabled` is `true`,
40+
- and either `indexes` is set, or `primary_key` is set with a non-`caching` `refresh_mode`.
41+
3842
### Secondary Indexes
3943

4044
Secondary indexes can be added on non-primary-key columns to accelerate equality lookups on those columns. Define them using the [`indexes`](./indexes) field in the acceleration configuration:
@@ -50,8 +54,6 @@ datasets:
5054
email: unique
5155
status: enabled
5256
'(region, category)': unique
53-
params:
54-
hash_index: enabled
5557
```
5658

5759
Index types:
@@ -67,11 +69,14 @@ Only single-column `unique` secondary indexes currently accelerate queries. Non-
6769

6870
### Configuration Options
6971

70-
| Parameter | Type | Required | Default | Description |
71-
| ------------- | -------------------- | --------------------------- | ---------- | -------------------------------------------- |
72-
| `hash_index` | `enabled`/`disabled` | No | `disabled` | Enable hash indexing |
73-
| `primary_key` | string or list | Yes (if hash_index enabled) | None | Column(s) for the primary key index |
74-
| `indexes` | YAML map | No | None | Secondary indexes (see [indexes](./indexes)) |
72+
| Parameter | Type | Required | Default | Description |
73+
| ------------- | -------------- | ---------------------------------------- | ------- | -------------------------------------------- |
74+
| `primary_key` | string or list | Yes (unless `indexes` is set) | None | Column(s) for the primary key index |
75+
| `indexes` | YAML map | No | None | Secondary indexes (see [indexes](./indexes)) |
76+
77+
:::note `hash_index` parameter is ignored
78+
The legacy `hash_index: enabled` parameter is accepted but no longer activates indexing on its own. When set, the runtime logs a warning and falls back to the automatic rules above. Remove `hash_index` from `params` to clear the warning.
79+
:::
7580

7681
## Supported Data Types
7782

@@ -239,22 +244,23 @@ Uses XXH3_64 with a fixed seed (`0x5370_6963_6541_4920` = "SpiceAI ") for:
239244

240245
**Solution**: This is expected behavior for small datasets. The full scan is faster than index overhead.
241246

242-
### Warning: "Add 'hash_index: enabled' to use primary_key for fast lookups"
247+
### Warning: "The hash_index acceleration parameter is ignored for Arrow acceleration"
243248

244-
**Cause**: `primary_key` is specified but `hash_index` is not enabled.
249+
**Cause**: `hash_index: enabled` is set in `params` but no longer activates indexing on its own.
245250

246-
**Solution**: Add `hash_index: enabled` to `params`:
251+
**Solution**: Remove `hash_index` from `params`. Hash indexing activates automatically when `primary_key` or `indexes` is configured on an Arrow-accelerated dataset (see [Configuration](#configuration)).
247252

248-
```yaml
249-
params:
250-
hash_index: enabled
251-
```
253+
### Hash index not active despite `primary_key` being set
254+
255+
**Cause**: `refresh_mode: caching` disables hash indexing even when `primary_key` is set; the caching path uses its own lookup strategy.
256+
257+
**Solution**: Use a non-caching `refresh_mode` (e.g. `full`, `append`, `changes`) for datasets that need point-lookup acceleration via the hash index.
252258

253259
### High Memory Usage
254260

255261
**Cause**: Index consumes ~17 bytes per row.
256262

257263
**Solution**:
258264

259-
- Disable hash_index for datasets where point lookups are rare
265+
- Remove `primary_key` for datasets where point lookups are rare (hash indexing stops being applied)
260266
- Consider using a different acceleration engine for very large datasets

0 commit comments

Comments
 (0)