|
| 1 | +--- |
| 2 | +title: 'Distributed Query' |
| 3 | +sidebar_label: 'Distributed Query' |
| 4 | +description: 'Learn how to run Spice in distributed mode for larger scale queries.' |
| 5 | +sidebar_position: 4 |
| 6 | +pagination_prev: null |
| 7 | +pagination_next: null |
| 8 | +--- |
| 9 | + |
| 10 | +Learn how to configure and run Spice in distributed mode to handle larger scale queries across multiple nodes. |
| 11 | + |
| 12 | +:::info Preview |
| 13 | +Multi-node distributed query execution based on Apache Ballista is available as a preview feature in Spice `v1.9.0`. |
| 14 | +::: |
| 15 | + |
| 16 | +## Overview |
| 17 | + |
| 18 | +Spice integrates [Apache Ballista](https://github.com/apache/datafusion-ballista) to schedule and coordinate distributed queries across multiple executor nodes. This integration enables distributed execution when running large queries over partitioned data lake formats such as Parquet, Delta Lake, or Iceberg. |
| 19 | + |
| 20 | +## Architecture |
| 21 | + |
| 22 | +A distributed Spice cluster consists of two components: |
| 23 | + |
| 24 | +- **Scheduler** – Plans distributed queries and manages the work queue for the executor fleet. Single instance per cluster. |
| 25 | +- **Executors** – One or more nodes responsible for executing physical query plans. |
| 26 | + |
| 27 | +The scheduler holds the cluster-wide configuration for a Spicepod, while executors connect to the scheduler to receive work. |
| 28 | + |
| 29 | +## Getting Started |
| 30 | + |
| 31 | +Cluster deployment typically starts with a scheduler instance, followed by one or more executors that register with the scheduler. |
| 32 | + |
| 33 | +### Start the Scheduler |
| 34 | + |
| 35 | +The scheduler is the only `spiced` process that needs to be configured (i.e. have a `spicepod.yaml` in the current dir). Override the Flight bind address when it must be reachable outside of `localhost`: |
| 36 | + |
| 37 | +```bash |
| 38 | +# Start scheduler |
| 39 | +spiced --cluster-mode scheduler --flight 0.0.0.0:50051 |
| 40 | +``` |
| 41 | + |
| 42 | +### Start Executors |
| 43 | + |
| 44 | +Executors need the scheduler's Flight URI to register and pull work. The executors do not require a `spicepod.yaml` to be present, it will fetch the configuration from the coordinator. Each executor automatically selects a free port if the default is unavailable: |
| 45 | + |
| 46 | +```bash |
| 47 | +# Start executor |
| 48 | +spiced --cluster-mode executor --scheduler-url spiced://localhost:50051 |
| 49 | +``` |
| 50 | + |
| 51 | +## Query Execution |
| 52 | + |
| 53 | +Queries run against the scheduler endpoint. The `EXPLAIN` output confirms that distributed planning is active—Spice includes a `distributed_plan` section showing how stages are split across executors: |
| 54 | + |
| 55 | +```sql |
| 56 | +EXPLAIN SELECT count(id) FROM my_dataset; |
| 57 | +``` |
| 58 | + |
| 59 | +:::warning[Limitations] |
| 60 | + |
| 61 | +- Accelerated datasets are not yet supported; distributed query currently targets partitioned data lake sources. |
| 62 | +- As a preview feature, clusters may encounter stability or performance issues. |
| 63 | +- Accelerator support is planned for future releases; follow release notes for updates. |
| 64 | + |
| 65 | +::: |
0 commit comments