You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: website/docs/features/data-acceleration/hash-index.md
+24-18Lines changed: 24 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,7 +22,7 @@ The hash index is an optional, high-performance indexing feature for Arrow-accel
22
22
23
23
## Configuration
24
24
25
-
To use the hash index, explicitly enable it and specify a primary key:
25
+
Hash indexing activates automatically on Arrow-accelerated datasets when a `primary_key` or [secondary index](./indexes) is configured. No additional parameter is required.
26
26
27
27
```yaml
28
28
datasets:
@@ -31,10 +31,14 @@ datasets:
31
31
acceleration:
32
32
engine: arrow
33
33
primary_key: order_id
34
-
params:
35
-
hash_index: enabled
36
34
```
37
35
36
+
The hash index activates whenever:
37
+
38
+
- `engine` is `arrow` or `partitioned_arrow`,
39
+
- `acceleration.enabled`is `true`,
40
+
- and either `indexes` is set, or `primary_key` is set with a non-`caching` `refresh_mode`.
41
+
38
42
### Secondary Indexes
39
43
40
44
Secondary indexes can be added on non-primary-key columns to accelerate equality lookups on those columns. Define them using the [`indexes`](./indexes) field in the acceleration configuration:
@@ -50,8 +54,6 @@ datasets:
50
54
email: unique
51
55
status: enabled
52
56
'(region, category)': unique
53
-
params:
54
-
hash_index: enabled
55
57
```
56
58
57
59
Index types:
@@ -67,11 +69,14 @@ Only single-column `unique` secondary indexes currently accelerate queries. Non-
| `primary_key` | string or list | Yes (unless `indexes` is set) | None | Column(s) for the primary key index |
75
+
| `indexes` | YAML map | No | None | Secondary indexes (see [indexes](./indexes)) |
76
+
77
+
:::note `hash_index` parameter is ignored
78
+
The legacy `hash_index: enabled` parameter is accepted but no longer activates indexing on its own. When set, the runtime logs a warning and falls back to the automatic rules above. Remove `hash_index` from `params` to clear the warning.
79
+
:::
75
80
76
81
## Supported Data Types
77
82
@@ -239,22 +244,23 @@ Uses XXH3_64 with a fixed seed (`0x5370_6963_6541_4920` = "SpiceAI ") for:
239
244
240
245
**Solution**: This is expected behavior for small datasets. The full scan is faster than index overhead.
241
246
242
-
### Warning: "Add 'hash_index: enabled' to use primary_key for fast lookups"
247
+
### Warning: "The hash_index acceleration parameter is ignored for Arrow acceleration"
243
248
244
-
**Cause**: `primary_key` is specified but `hash_index` is not enabled.
249
+
**Cause**: `hash_index: enabled` is set in `params` but no longer activates indexing on its own.
245
250
246
-
**Solution**: Add `hash_index: enabled` to `params`:
251
+
**Solution**: Remove `hash_index` from `params`. Hash indexing activates automatically when `primary_key` or `indexes` is configured on an Arrow-accelerated dataset (see [Configuration](#configuration)).
247
252
248
-
```yaml
249
-
params:
250
-
hash_index: enabled
251
-
```
253
+
### Hash index not active despite `primary_key` being set
254
+
255
+
**Cause**: `refresh_mode: caching` disables hash indexing even when `primary_key` is set; the caching path uses its own lookup strategy.
256
+
257
+
**Solution**: Use a non-caching `refresh_mode` (e.g. `full`, `append`, `changes`) for datasets that need point-lookup acceleration via the hash index.
252
258
253
259
### High Memory Usage
254
260
255
261
**Cause**: Index consumes ~17 bytes per row.
256
262
257
263
**Solution**:
258
264
259
-
- Disable hash_index for datasets where point lookups are rare
265
+
- Remove `primary_key` for datasets where point lookups are rare (hash indexing stops being applied)
260
266
- Consider using a different acceleration engine for very large datasets
0 commit comments