Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions website/docs/features/data-acceleration/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ Locally accelerated datasets can also have [primary key constraints](data-accele

[Acceleration snapshots](data-acceleration/snapshots) (preview) help file-mode accelerations become ready in seconds by bootstrapping from managed snapshots stored in object storage such as Amazon S3.

For larger datasets, [partitioning](data-acceleration/partitioning) splits the acceleration into smaller physical units (Hive-style files, per-partition tables, or in-memory tables) keyed by an expression. Queries that filter on the partitioning column read only the relevant partitions, dramatically reducing scan size.

## Example Use Case

Consider a high-volume e-trading frontend application backed by an AWS RDS database containing a table of trades. To retrieve all trades over the last 24 hours, the application would need to query the remote database and transfer the data over the network. By accelerating the trades table locally using the [AWS RDS Data Connector](https://github.com/spiceai/cookbook/tree/trunk/mysql/rds-aurora#readme), the data is brought to the application, saving round trip time and data transfer time.
Expand Down
346 changes: 327 additions & 19 deletions website/docs/features/data-acceleration/partitioning.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,11 @@
---
title: 'Partitioning'
sidebar_label: 'Partitioning'
description: 'Partitioning for accelerated datasets'
description: 'Partition accelerated datasets to make filtered queries faster by reading only the relevant partitions.'
sidebar_position: 4
---

Accelerations can be partitioned using an arbitrary expression to group rows together into separate files.
This enables Spice to avoid reading unnecessary partitions, making particular queries faster.
To partition your accelerations, add the `partition_by` acceleration parameter:
Partitioning splits an accelerated dataset into multiple physical units (files, tables, or in-memory tables) keyed by an expression evaluated per row. Queries that filter on the partitioning expression — or on a column it references — only read the partitions that can match, dramatically reducing the data scanned.

```yaml
datasets:
Expand All @@ -23,26 +21,336 @@ datasets:
- bucket(50, PULocationID)
```

This example uses a `bucket` user-defined function (UDF) to hash the `PULocationID` column and put each row into one of 50 partition files.
This config writes 50 separate DuckDB files, each containing the rows whose `PULocationID` hashes to the same bucket. A query like `WHERE PULocationID = 132` only opens the single bucket file that could contain `132`.

This enables partition pruning for queries that filter on the column referenced in the `partition_by` expression:
## How partitioning works

```sql
SELECT * FROM taxi_trips WHERE PULocationID IN (1, 2, 3, 4, 5);
1. **At refresh time**, Spice evaluates each `partition_by` expression for every row and routes the row to a partition keyed by the expression's value.
2. **At query time**, Spice rewrites filters that reference the partition column or expression into a partition selection, and only reads partitions that could satisfy the filter ([partition pruning](#partition-pruning)).
3. **Composite partitioning** (Arrow and Cayenne) layers multiple expressions hierarchically — e.g. `year` then `month` — to combine pruning across dimensions.

Partitioning is most useful when:

- Queries reliably filter on a small subset of values (`region = 'EU'`, `tenant_id IN (...)`, `created_at >= today() - INTERVAL '7 days'`).
- A single column has high cardinality and partitioning by it directly would create too many tiny files — `bucket(N, col)` collapses to `N` partitions.
- The dataset grows large enough that scanning the whole acceleration on every query is expensive.

## Configuration

### `partition_by`

Lives directly under `acceleration:` (not under `acceleration.params:`). It's a list of expressions; each entry is either a plain string or a single-entry `{ name: expression }` mapping:

```yaml
acceleration:
enabled: true
engine: arrow
partition_by:
# Anonymous expression — auto-named "expr0", "expr1", …
- "YEAR(created_at)"
# Named expression
- month: "MONTH(created_at)"
```

Multi-entry mappings (`- year: "…", month: "…"` on one list item) are rejected at load time.

### `partition_mode` (DuckDB only)

Under `acceleration.params.partition_mode`. Selects how DuckDB physically lays out partitions:

| Value | Layout | Default |
| -------- | ---------------------------------------------------------------------------------------------------- | ------- |
| `files` | Each partition is its own DuckDB file in a Hive-style directory (`column=value/…`). | ✓ |
| `tables` | Single DuckDB file with one table per partition (discovered via `information_schema.tables`). | |

```yaml
acceleration:
enabled: true
engine: duckdb
mode: file
partition_by:
- bucket(50, PULocationID)
params:
partition_mode: tables # default is `files`
```

### Supported engines

| Engine | Required `mode:` | Multi-expression | Layout |
| --------- | ---------------- | ---------------- | ----------------------------------------------- |
| `arrow` | (memory; default)| Yes | One Arrow `MemTable` per partition value. |
| `duckdb` | `file` | No (single only) | Hive-style files (`partition_mode: files`) or per-partition tables (`partition_mode: tables`). |
| `cayenne` | `file` | Yes | One Vortex table per partition; catalog in a SQLite metadata file. |

`sqlite`, `postgres`, and `turso` accelerators do **not** support `partition_by`.

## Partition transforms

`partition_by` accepts any DataFusion-compatible scalar SQL expression that returns a String, integer, Boolean, or Timestamp and references exactly one column from the dataset. The most common partition transforms are:

### `bucket(num_buckets, column)`

Hashes `column` into `num_buckets` deterministic buckets.

- `num_buckets` must be a positive integer literal `≤ 1,000,000`.
- The return type matches `num_buckets` — `bucket(50, …)` (Int64 literal) returns `Int64`; `bucket(50::int32, …)` returns `Int32`.
- Same input always maps to the same bucket for a given `num_buckets` (uses ahash with a fixed seed).
- Use for: high-cardinality columns where direct partitioning would create too many partitions (`user_id`, `account_id`, `device_id`).

```yaml
partition_by:
- bucket(100, user_id)
```

This will result in a scan plan that only reads from the partitions that contain the values from the `IN` list.
See the [`bucket` reference](../../reference/sql/scalar_functions#bucket) for the full SQL signature.

### `truncate(width, value)`

Truncates to the next-lower multiple of `width`. Iceberg's truncate transform.

- `width` must be a positive `Int64` literal.
- `value` may be any signed/unsigned integer, `Decimal128`/`Decimal256`, `Utf8` (string), or `Binary`. For strings/binary, returns the first `width` units.
- Returns the same type as `value`.
- Use for: floor-bucketing wide numeric ranges (`truncate(1000, amount)`), or grouping strings by prefix (`truncate(2, country_code)`).

```yaml
partition_by:
- truncate(1000, amount)
```

See the [`truncate` reference](../../reference/sql/scalar_functions#truncate) for examples.

### `date_part(unit, column)` and `date_trunc(unit, column)`

:::warning[Limitations]
Built-in DataFusion datetime functions. Useful for time-based partitioning at year, month, day, or hour granularity.

- Partitioning is currently limited to `engine: duckdb` or `engine: cayenne` with `mode: file`.
- `partition_by` currently only supports 1 expression for partitioning.
- Expression must reference exactly one column from the dataset.
- Expression must produce a scalar value.
- Expression cannot contain a subquery.
- Partition pruning is limited to specific filter expressions such as:
- `WHERE foo = bar`
- `WHERE foo IN (bar, baz, ...)`
- `WHERE foo NOT IN (bar, baz, ...)`
- `date_part('year', col)` returns the integer year (e.g. `2026`).
- `date_trunc('day', col)` returns the timestamp truncated to the start of the day.

```yaml
partition_by:
- date_part('year', l_shipdate)
```

```yaml
partition_by:
- day: "date_trunc('day', created_at)"
```

`YEAR(col)`, `MONTH(col)`, `DAY(col)`, etc. are aliases of `date_part(...)` and work identically.

:::note Filter pruning for `date_part()` is not yet implemented
A `date_part('year', l_shipdate)` partition still produces correctly-distributed partitions, but a filter like `WHERE l_shipdate >= '2026-01-01'` does not currently translate to a partition pruning. Queries return correct results — they just scan more partitions than necessary. If your filter is on the bare partition expression (`WHERE date_part('year', l_shipdate) = 2026`), the equality form does prune.
:::

### Modulo (`column % N`)

A plain modulo expression also produces stable, partition-prunable buckets:

```yaml
partition_by:
- "id % 16"
```

Range filters on the base column (`id BETWEEN 0 AND 1000`) are pruned for `%` partitions.

### Plain column reference

A bare column name partitions one partition per distinct value — useful for low-cardinality columns:

```yaml
partition_by:
- region
```

Pruning supports equality, `IN`, `NOT IN`, and range filters on the column.

## Composite partitioning (Arrow + Cayenne)

Arrow and Cayenne accelerations accept multiple `partition_by` expressions. Spice partitions hierarchically — first by the leftmost expression, then by the next, and so on.

```yaml
acceleration:
enabled: true
engine: cayenne
mode: file
partition_by:
- year: "date_part('year', created_at)"
- month: "date_part('month', created_at)"
- region: region
```

A query with `WHERE region = 'EU' AND date_part('year', created_at) = 2026` prunes on both axes.

DuckDB partitioned acceleration rejects multiple expressions with `PartitionByRequired` (single-expression only).

## Partition pruning

Spice attempts to translate query-time filters into a partition selection. The matrix below summarizes which filter shapes prune which partition transforms:

| Filter shape | Plain column / `truncate` / `date_trunc` / `% N` | `bucket(N, col)` |
| ------------------------------------ | ------------------------------------------------ | --------------------------------------------- |
| `col = X` | ✓ | ✓ (filter substituted into bucket expression) |
| `col != X` | ✓ | ✗ (no pruning) |
| `col IN (a, b, …)` | ✓ | ✓ |
| `col NOT IN (a, b, …)` | ✓ | ✗ (no pruning) |
| `col < X` / `<= X` / `> X` / `>= X` | ✓ | ✓ only for **bounded** ranges (both lower and upper) |
| `col BETWEEN a AND b` | ✓ | ✓ (when range expands to ≤ a few thousand Int32 values) |
| `col = a OR col = b OR …` | ✓ | ✓ |
| `partition_expr = X` (e.g. `bucket(50, c) = 7`) | ✓ | ✓ |
| `expr1_filter AND expr2_filter` (composite partitions) | ✓ | ✓ |

Pruning notes:

- **Filter on the base column is substituted into the partition expression.** `WHERE user_id = 42` against a `bucket(100, user_id)` partition evaluates `bucket(100, 42)` once and reads only that partition.
- **Bucket inequality pruning** enumerates candidate values within a bounded `Int32` range (capped at `MAX_BUCKET_ENUMERATION_I32` candidate values). Open-ended ranges (`col < X` with no lower bound) and non-`Int32` types fall back to no pruning.
- **`date_part()` filter pruning is not yet implemented** for time-range filters. Equality on the partition expression still prunes; range filters on the base column do not.
- **Filters that don't fully resolve at the partition layer** are passed through to the data layer — pruning is best-effort and never returns wrong rows.

## Engine-specific behavior

### Arrow (`engine: arrow`)

In-memory MemTable per partition value. Partitions are rebuilt on every refresh. The simplest engine to start with for moderate datasets that fit in RAM. `hash_index` and `sort_columns` are propagated per partition.

### DuckDB files mode (`engine: duckdb`, `partition_mode: files`)

Each partition value becomes its own DuckDB file under a Hive-style directory layout (e.g. `accelerator_dir/dataset/year=2026/data.duckdb`). Useful when you want OS-level visibility of partitions and per-partition I/O.

Requires `mode: file`. Single-expression only.

### DuckDB tables mode (`engine: duckdb`, `partition_mode: tables`)

A single DuckDB file with one table per partition value. Cheaper to manage operationally (one file to back up, one connection pool to size), but loses per-partition file-level isolation.

Requires `mode: file`. Single-expression only.

### Cayenne (`engine: cayenne`)

Each partition is a separate Cayenne (Vortex) table; the partition catalog is tracked in a SQLite metadata file. Cayenne supports composite partitioning natively and is the right pick for very large datasets where Arrow would not fit in memory.

Requires `mode: file`.

## Changing `partition_by` after refresh

Once partitions exist on disk, changing `partition_by` is rejected:

> The `partition_by` expressions are different from the expressions used to create the existing partition files. Revert the `partition_by` expressions, delete the partition files, or change the location the partition files are stored to create new partitions.

To re-partition:

1. Stop the runtime.
2. Delete the partitioned acceleration directory or point `accelerator_dir` at a fresh location.
3. Update `partition_by`.
4. Restart — Spice will rebuild from the source.

There is no automatic in-place re-partitioning.

## Validation rules

A `partition_by` expression is rejected at startup if any of the following hold:

- Result type is not `String`, an integer (signed or unsigned), `Boolean`, or `Timestamp`.
- Expression references zero or multiple dataset columns.
- Expression contains a subquery, `OUTER REFERENCE`, `UNNEST`, window function, aggregate function, `EXISTS`, `GROUPING SET`, or `PLACEHOLDER`.
- Expression aliases the column (`col AS partition_key`).

For DuckDB engines, additionally:

- `mode:` must be `file`.
- Only one `partition_by` expression is allowed.

## Examples

### High-cardinality hash partitioning

Partition events by user, collapsing millions of users into 100 stable buckets:

```yaml
datasets:
- from: s3://my-bucket/events/
name: events
params:
file_format: parquet
acceleration:
enabled: true
engine: duckdb
mode: file
partition_by:
- bucket(100, user_id)
```

```sql
-- Reads the single bucket that hashes user_id 42:
SELECT COUNT(*) FROM events WHERE user_id = 42;

-- Reads at most 4 bucket files:
SELECT * FROM events WHERE user_id IN (1, 2, 3, 4);
```

### Time-based partitioning by year

```yaml
acceleration:
enabled: true
engine: duckdb
mode: file
partition_by:
- date_part('year', l_shipdate)
```

```sql
-- Equality on the partition expression prunes:
SELECT * FROM lineitem WHERE date_part('year', l_shipdate) = 2026;
```

### Composite year/month partitioning (Cayenne)

```yaml
acceleration:
enabled: true
engine: cayenne
mode: file
partition_by:
- year: "date_part('year', created_at)"
- month: "date_part('month', created_at)"
```

### Truncate-based numeric partitioning

```yaml
acceleration:
enabled: true
engine: arrow
partition_by:
- "truncate(1000, order_total_cents)"
```

```sql
-- Range pruning works because truncate is monotonic:
SELECT * FROM orders WHERE order_total_cents BETWEEN 5000 AND 9999;
```

### Plain column partitioning by region

```yaml
acceleration:
enabled: true
engine: arrow
partition_by:
- region
```

```sql
-- One partition read:
SELECT * FROM events WHERE region = 'EU';
```

## Related

- [`bucket` SQL reference](../../reference/sql/scalar_functions#bucket)
- [`truncate` SQL reference](../../reference/sql/scalar_functions#truncate)
- [`date_part` SQL reference](../../reference/sql/scalar_functions#date_part)
- [`date_trunc` SQL reference](../../reference/sql/scalar_functions#date_trunc)
- [Data refresh modes](./data-refresh) — `time_partition_column` for time-pruned refreshes
- [Sharded deployment](../../deployment/architectures/sharded) — partitioning across multiple Spice instances rather than within one acceleration
Loading
Loading