diff --git a/README.md b/README.md index 49a80608..f3d4ca06 100644 --- a/README.md +++ b/README.md @@ -11,6 +11,10 @@ Welcome to the Spice.ai OSS Cookbook—a comprehensive collection of recipes for ### Core scenarios - [Federated SQL Query](./federation/README.md) - Query data from S3, PostgreSQL, and Dremio in a single query. +- [Cayenne Data Accelerator](./cayenne/README.md) +- [Async Queries](./async-queries/README.md) - Submit long-running SQL queries and retrieve results asynchronously. +- [Hybrid-Search](./search/README.md) - Combine keyword and vector search for improved retrieval. +- [AI SQL Function](./ai/README.md) - Use the `ai()` SQL function to invoke LLMs directly in SQL queries for text generation, sentiment analysis, and data enrichment. ### Sample Applications @@ -18,44 +22,45 @@ Welcome to the Spice.ai OSS Cookbook—a comprehensive collection of recipes for ### Models & AI - Connect data to hosted or local AI models -- [AI SQL Function](./ai/README.md) - Use the `ai()` SQL function to invoke LLMs directly in SQL queries for text generation, sentiment analysis, and data enrichment. -- [Azure OpenAI Models](./azure_openai/README.md) -- [Generative Visualizations](./generative-visualisations/README.md) - Generate SQL queries and Chart.js visualizations from natural language using AI. -- [Running Llama3 Locally](./llama/README.md) - Use the Llama family of models locally from HuggingFace using Spice. +- [AI SQL Function](./ai/README.md) - Invoke LLMs directly in SQL queries for text generation and data enrichment. +- [Azure OpenAI Models](./azure_openai/README.md) - Use Azure OpenAI for search and chat. +- [Generative Visualizations](./generative-visualisations/README.md) - Generate SQL queries and visualizations from natural language. +- [Running Llama3 Locally](./llama/README.md) - Run Llama models locally from HuggingFace. - [OpenAI Models](./models/openai/README.md) - Use OpenAI LLM and embedding models. - [OpenAI SDK](./openai_sdk/README.md) - Use the OpenAI SDK to connect to models hosted on Spice. - [LLM Memory](./llm-memory/README.md) - Persistent memory for language models. -- [Text to SQL (Tools)](./text-to-sql/README.md) -- [Nvidia NIM on Kubernetes](./nvidia-nim/kubernetes/README.md) - Deploy Nvidia NIM infrastructure, on Kubernetes, with GPUs connected to Spice. -- [Nvidia NIM on AWS EC2](./nvidia-nim/ec2/README.md) - Deploy Nvidia NIM on AWS GPU-optimized EC2 instances connected to Spice. -- [Searching GitHub Files](./search_github_files/README.md) - Search GitHub files with embeddings and vector similarity search. +- [Text to SQL (Tools)](./text-to-sql/README.md) - Query data with natural language. +- [Nvidia NIM on Kubernetes](./nvidia-nim/kubernetes/README.md) - Deploy Nvidia NIM on Kubernetes with GPUs. +- [Nvidia NIM on AWS EC2](./nvidia-nim/ec2/README.md) - Deploy Nvidia NIM on AWS GPU-optimized EC2 instances. +- [Searching GitHub Files](./search_github_files/README.md) - Search GitHub files with embeddings and vector search. - [xAI Models](./models/xai/README.md) - Use xAI models such as Grok. - [DeepSeek Model](./deepseek/README.md) - Use DeepSeek model through Spice. - [Filesystem Hosted Model](./models/filesystem/README.md) - Use models hosted directly on filesystems. -- [Web Search Tools using Perplexity](./websearch/README.md) - Provide LLMs with web search access for more informed answers. +- [Web Search Tools using Perplexity](./websearch/README.md) - Give LLMs web search access via Perplexity. - [Language Model Evaluations](./evals/README.md) - Use Spice to evaluate language models. -- [LLM as a Judge](./llm-judge/README.md) - Define LLM judge models to evaluate the performance of other language models. +- [LLM as a Judge](./llm-judge/README.md) - Define LLM judge models to evaluate other models. - [OpenAI Responses API](./openai-responses-api/README.md) - Use OpenAI's Responses API with Spice +- [Model Context Protocol (MCP)](./mcp/README.md) - Connect to MCP servers and use MCP tools with Spice. ### Data Acceleration - Materializing & accelerating data locally with Data Accelerators -- [Cayenne Data Accelerator](./cayenne/README.md) -- [DuckDB Data Accelerator](./duckdb/accelerator/README.md) -- [Hashed Partitioning with DuckDB](./hashed_partitioning/README.md) -- [PostgreSQL Data Accelerator](./postgres/accelerator/README.md) -- [SQLite Data Accelerator](./sqlite/accelerator/README.md) -- [Database Snapshots](./acceleration/snapshots/README.md) - Bootstrap DuckDB accelerations from object storage to skip cold starts. -- [Apache Arrow Data Accelerator](./arrow/README.md) -- [Accelerated Views](./views/README.md) +- [Cayenne Data Accelerator](./cayenne/README.md) - Accelerate data using Cayenne. +- [DuckDB Data Accelerator](./duckdb/accelerator/README.md) - Accelerate data using DuckDB. +- [Hashed Partitioning with DuckDB](./hashed_partitioning/README.md) - Prune data with hashed partitioning on categorical columns. +- [PostgreSQL Data Accelerator](./postgres/accelerator/README.md) - Materialize data into an attached PostgreSQL instance. +- [SQLite Data Accelerator](./sqlite/accelerator/README.md) - Accelerate data using SQLite. +- [Database Snapshots](./acceleration/snapshots/README.md) - Bootstrap accelerations from object storage to skip cold starts. +- [Apache Arrow Data Accelerator](./arrow/README.md) - Accelerate data using in-memory Arrow. +- [Accelerated Views](./views/README.md) - Pre-calculate and materialize derived data for faster queries. - [Dataset Partitioning](./acceleration/partitioning/README.md) - Partition accelerated datasets to improve query performance. ### Consuming and visualizing data with clients - [Sales BI (Apache Superset)](./sales-bi/README.md) - Visualize data in Spice with Apache Superset. - [Grafana Datasource](./grafana-datasource/README.md) - Add Spice as a Grafana datasource. -- [Python ADBC Client](./clients/adbc/README.md) - Query Spice using ADBC and Parameterized Queries with Python. -- [Java JDBC Client](./clients/java/README.md) - Query Spice using JDBC and Parameterized Queries with Java. -- [Scala JDBC Client](./clients/scala/README.md) - Query Spice using JDBC and Parameterized Queries with Scala. +- [Python ADBC Client](./clients/adbc/README.md) - Query Spice using ADBC with Python. +- [Java JDBC Client](./clients/java/README.md) - Query Spice using JDBC with Java. +- [Scala JDBC Client](./clients/scala/README.md) - Query Spice using JDBC with Scala. ### Connecting to Data Sources with Data Connectors @@ -65,49 +70,54 @@ Welcome to the Spice.ai OSS Cookbook—a comprehensive collection of recipes for - [MySQL Data Connector](./mysql/connector/README.md) - [AWS RDS Aurora (MySQL Compatible)](./mysql/rds-aurora/README.md) - [PlanetScale](./mysql/planetscale/README.md) -- [Clickhouse Data Connector](./clickhouse/README.md) +- [Clickhouse Data Connector](./clickhouse/README.md) - Connect to ClickHouse as a data source. - [Databricks Connector](./databricks/README.md) - Delta Lake and Spark Connect. - [Delta Lake Connector](./delta-lake/README.md) - Query data from Delta Lake tables. -- [Debezium Change Data Capture (CDC) Data Connector from Postgres](./cdc-debezium/README.md) - Stream changes from a Postgres database to Spice. - - [Debezium CDC SASL/SCRAM Authentication from MySQL](./cdc-debezium/sasl-scram/README.md) - Stream changes from a MySQL database to Spice using SASL/SCRAM authentication. -- [Dremio Data Connector](./dremio/README.md) +- [Debezium CDC Data Connector](./cdc-debezium/README.md) - Stream changes from Postgres to Spice. + - [Debezium CDC SASL/SCRAM from MySQL](./cdc-debezium/sasl-scram/README.md) - Stream changes from MySQL using SASL/SCRAM. +- [DynamoDB Data Connector](./dynamodb/README.md) - Query data from an AWS-hosted DynamoDB table. + - [DynamoDB Streams](./dynamodb/streams/README.md) - Stream real-time changes from DynamoDB tables. +- [Dremio Data Connector](./dremio/README.md) - Connect to a Dremio instance. - [DuckDB Data Connector](./duckdb/connector/README.md) - Use a DuckDB database with sample TPCH data. - [File Data Connector](./file/README.md) - Query data from local files. - [FTP Data Connector](./ftp/README.md) - Query data from an FTP server. -- [Glue Data Connector](./glue/README.md) -- [GitHub Data Connector](./github/README.md) -- [GraphQL Data Connector](./graphql/README.md) +- [Glue Data Connector](./glue/README.md) - Query tables in an AWS Glue Data Catalog. +- [GitHub Data Connector](./github/README.md) - Query GitHub repository data. +- [GraphQL Data Connector](./graphql/README.md) - Connect to GraphQL endpoints. - [HTTP Data Connector](./http/README.md) - Query data from HTTP(s) endpoints like REST APIs. -- [MSSQL (Microsoft SQL Server) Data Connector](./mssql/README.md) -- [ODBC Data Connector](./odbc/README.md) +- [MongoDB Data Connector](./mongodb/connector/README.md) - Connect to MongoDB as a data source. +- [MSSQL (Microsoft SQL Server) Data Connector](./mssql/README.md) - Query across multiple SQL Server instances. +- [ODBC Data Connector](./odbc/README.md) - Connect to databases via ODBC. - [Amazon Redshift](./redshift/README.md) - Read and write TPC-H data with Amazon Redshift. -- [Oracle Data Connector](./oracle/README.md) -- [S3 Data Connector](./s3/README.md) +- [Oracle Data Connector](./oracle/README.md) - Connect to and accelerate data from Oracle. +- [S3 Data Connector](./s3/README.md) - Query data from an S3 bucket. - [ScyllaDB Data Connector](./scylladb/README.md) - Query data from ScyllaDB clusters using federated SQL. -- [SharePoint/OneDrive for Business Data Connector](./sharepoint/README.md) +- [SharePoint/OneDrive for Business Data Connector](./sharepoint/README.md) - Query documents in SharePoint. - [SMB Data Connector](./smb/README.md) - Query data files from SMB/CIFS network shares. -- [Snowflake Data Connector](./snowflake/README.md) -- [Spice.ai Cloud Platform Data Connector](./spiceai/README.md) -- [Apache Spark Data Connector](./spark/README.md) -- [Apache Kafka Data Connector](./kafka/README.md) -- [IMAP Data Connector](./imap/README.md) +- [Snowflake Data Connector](./snowflake/README.md) - Access a Snowflake database. +- [Spice.ai Cloud Platform Data Connector](./spiceai/README.md) - Connect to Spice.ai Cloud Platform datasets. +- [Apache Spark Data Connector](./spark/README.md) - Read data from an Apache Spark instance. +- [Apache Kafka Data Connector](./kafka/README.md) - Stream data from Kafka with federated queries. +- [IMAP Data Connector](./imap/README.md) - Connect to an IMAP email server. - [Connecting to an Outlook mailbox](./imap/outlook.md) ### Connecting to Data Sources with Catalog Connectors -- [Spice.ai Cloud Platform Catalog Connector](./catalogs/spiceai/README.md) -- [Databricks Unity Catalog Connector](./catalogs/databricks/README.md) -- [Unity Catalog Connector](./catalogs/unity_catalog/README.md) -- [Iceberg Catalog Connector](./catalogs/iceberg/README.md) -- [Glue Catalog Connector](./catalogs/glue/README.md) +- [Spice.ai Cloud Platform Catalog Connector](./catalogs/spiceai/README.md) - Query datasets in Spice.ai Cloud Platform. +- [Databricks Unity Catalog Connector](./catalogs/databricks/README.md) - Query Databricks Unity Catalog tables. +- [Unity Catalog Connector](./catalogs/unity_catalog/README.md) - Query an open-source Unity Catalog instance. +- [Iceberg Catalog Connector](./catalogs/iceberg/README.md) - Query and write to Iceberg tables. +- [Iceberg Hadoop Catalog Connector](./catalogs/iceberg-hadoop/README.md) - Connect to Hadoop catalogs on S3-compatible storage. +- [Glue Catalog Connector](./catalogs/glue/README.md) - Query tables in an AWS Glue Data Catalog. ### Using Vector Engines -- [Amazon S3 Vectors](./vectors/s3-vectors/README.md) - Use Amazon S3 as a vector engine for embeddings and similarity search. +- [Amazon S3 Vectors](./vectors/s3-vectors/README.md) - Use S3 as a vector engine for embeddings and similarity search. ## Search - [Hybrid-Search](./search/README.md) - Combine keyword and vector search for improved retrieval. +- [Full-Text Search](./full-text-search/README.md) - Retrieve records matching keywords using BM25 scoring. ### Deployment and Installation @@ -118,24 +128,24 @@ Welcome to the Spice.ai OSS Cookbook—a comprehensive collection of recipes for ### Performance -- [TPC-H Benchmarking](./tpc-h/README.md) -- [SQL Results Caching](./caching/sql_results/README.md) -- [Caching Accelerator](./caching/accelerator/README.md) - Intelligent HTTP response caching with Stale-While-Revalidate (SWR) support. -- [Indexes on Accelerated Data](./acceleration/indexes/README.md) +- [TPC-H Benchmarking](./tpc-h/README.md) - Run TPC-H benchmark queries. +- [SQL Results Caching](./caching/sql_results/README.md) - Cache query results in memory for faster repeated queries. +- [Caching Accelerator](./caching/accelerator/README.md) - HTTP response caching with SWR support. +- [Indexes on Accelerated Data](./acceleration/indexes/README.md) - Create indexes to improve query performance. ### Acceleration Data Configuration -- [Data Retention Policy](./retention/README.md) -- [Refresh Data Window](./refresh-data-window/README.md) -- [Advanced Data Refresh](./acceleration/data-refresh/README.md) -- [Data Quality with Constraints](./acceleration/constraints/README.md) +- [Data Retention Policy](./retention/README.md) - Evict data older than a specified duration. +- [Refresh Data Window](./refresh-data-window/README.md) - Filter data refresh to only recent data. +- [Advanced Data Refresh](./acceleration/data-refresh/README.md) - Configure and tune data refresh for accelerated datasets. +- [Data Quality with Constraints](./acceleration/constraints/README.md) - Enforce data quality constraints on accelerated datasets. ## Client SDKs - Recipes for querying data from Spice with language-specific SDKs - [Rust SDK](client-sdk/spice-rs-sdk-sample/README.md) - [Python SDK](client-sdk/spicepy-sdk-sample/README.md) - [Go SDK](client-sdk/gospice-sdk-sample/README.md) -- [JavaScript SDK (Node.js)](client-sdk/spice.js-sdk-sample/README.md) - Query NYC taxi trips data using the [`@spiceai/spice`](https://www.npmjs.com/package/@spiceai/spice) npm package. +- [JavaScript SDK (Node.js)](client-sdk/spice.js-sdk-sample/README.md) - Query data using the `@spiceai/spice` npm package. - [Java SDK](client-sdk/spice-java-sdk-sample/README.md) ### Security @@ -145,5 +155,6 @@ Welcome to the Spice.ai OSS Cookbook—a comprehensive collection of recipes for ### Advanced Topics -- [Local dataset replication](./localpod/README.md) - Link datasets in a parent/child relationship within the current Spicepod -- [Distributed Query](./distributed/README.md) - Run queries distributed across multiple nodes for maximum performance across large datasets +- [Local dataset replication](./localpod/README.md) - Link datasets in a parent/child relationship. +- [Distributed Query](./distributed/README.md) - Run queries distributed across multiple nodes. +- [JSON Strings](./json_strings/README.md) - Work with JSON strings using JSON functions. diff --git a/async-queries/README.md b/async-queries/README.md new file mode 100644 index 00000000..3ca0ef7f --- /dev/null +++ b/async-queries/README.md @@ -0,0 +1,327 @@ +# Async Queries + +> **Note:** Async queries require Spice v2.0 or later. + +This recipe demonstrates how to use the async queries API to submit long-running SQL queries and retrieve results asynchronously. It shows how to: + +- Submit queries via the HTTP API and CLI +- Poll for query completion +- Retrieve paginated results +- Cancel running queries +- Use the interactive `spice query` REPL + +Async queries build on top of [distributed query](../distributed/README.md) mode and require cluster mode with a scheduler and at least one executor. + +## Prerequisites + +- [Spice CLI](https://docs.spiceai.org/getting-started) installed (v2.0+) + +## Getting Started + +### Step 1: Prepare Working Directory + +Clone the cookbook repository and navigate to the `async-queries` directory. + +```bash +git clone https://github.com/spiceai/cookbook.git +cd cookbook/async-queries +``` + +### Step 2: Generate Development mTLS Certificates + +Generate mTLS certificates for the scheduler and executor: + +```bash +spice cluster tls init +spice cluster tls add scheduler1 +spice cluster tls add executor1 +``` + +### Step 3: Start the Spice Scheduler + +Start the scheduler with cluster mode and the `scheduler.state_location` configured in the `spicepod.yaml`: + +```bash +~/.spice/bin/spiced --role scheduler \ + --node-bind-address 127.0.0.1:50052 \ + --node-advertise-address 127.0.0.1 \ + --http 127.0.0.1:8090 \ + --flight 127.0.0.1:50051 \ + --node-mtls-ca-certificate-file ~/.spice/pki/ca.crt \ + --node-mtls-certificate-file ~/.spice/pki/scheduler1.crt \ + --node-mtls-key-file ~/.spice/pki/scheduler1.key +``` + +The scheduler starts and registers the `data` dataset: + +```console +2026-03-02T12:00:00.000000Z INFO spiced: Starting runtime +2026-03-02T12:00:01.000000Z INFO runtime::cluster: Starting Ballista scheduler on 0.0.0.0:50052 +2026-03-02T12:00:01.000000Z INFO runtime::init::dataset: Dataset data initializing... +2026-03-02T12:00:01.000000Z INFO runtime::flight: Spice Runtime Flight listening on 127.0.0.1:50051 +2026-03-02T12:00:01.000000Z INFO runtime::http: Spice Runtime HTTP listening on 127.0.0.1:8090 +2026-03-02T12:00:03.000000Z INFO runtime::init::dataset: Dataset data registered (s3://spiceai-public-datasets/hive_partitioned_data/), results cache enabled. +2026-03-02T12:00:03.000000Z INFO runtime: All components are loaded. Spice runtime is ready! +``` + +### Step 4: Start the Spice Executor + +In a new terminal, start the executor: + +```bash +~/.spice/bin/spiced --role executor \ + --http 127.0.0.1:9090 \ + --scheduler-address 127.0.0.1:50052 \ + --node-mtls-ca-certificate-file ~/.spice/pki/ca.crt \ + --node-mtls-certificate-file ~/.spice/pki/executor1.crt \ + --node-mtls-key-file ~/.spice/pki/executor1.key \ + --node-bind-address 127.0.0.1:50062 \ + --node-advertise-address 127.0.0.1 +``` + +```console +2026-03-02T12:01:00.000000Z INFO spiced: Starting runtime +2026-03-02T12:01:01.000000Z INFO ballista_executor::execution_loop: Starting poll work loop with scheduler +2026-03-02T12:01:01.000000Z INFO runtime: All components are loaded. Spice runtime is ready! +``` + +### Step 5: Submit an Async Query via HTTP + +In a new terminal, submit a query using `curl`: + +```bash +curl -s http://127.0.0.1:8090/v1/queries \ + -H "Content-Type: application/json" \ + -d '{"sql": "SELECT * FROM data LIMIT 100"}' | jq . +``` + +The API returns immediately with a query ID and status URLs: + +```json +{ + "query_id": "01ABC-DEF-456-7890AB", + "status": "PENDING", + "error": null, + "status_url": "/v1/queries/01ABC-DEF-456-7890AB/status", + "results_url": "/v1/queries/01ABC-DEF-456-7890AB/results" +} +``` + +### Step 6: Poll for Completion + +Use the `status_url` to poll until the query completes: + +```bash +curl -s http://127.0.0.1:8090/v1/queries/01ABC-DEF-456-7890AB/status | jq . +``` + +While still running: + +```json +{ + "status": "RUNNING", + "error": null +} +``` + +Once completed: + +```json +{ + "status": "SUCCEEDED", + "error": null +} +``` + +### Step 7: Retrieve Results + +Fetch the first chunk of results: + +```bash +curl -s http://127.0.0.1:8090/v1/queries/01ABC-DEF-456-7890AB/results | jq . +``` + +```json +{ + "chunk_index": 0, + "row_offset": 0, + "row_count": 100, + "next_chunk_index": null, + "next_chunk_url": null, + "data_array": [ + { "id": 30, "value": "value_0" }, + { "id": 31, "value": "value_1" } + ] +} +``` + +For queries with more than 10,000 rows, follow the `next_chunk_url` to paginate through results: + +```bash +curl -s http://127.0.0.1:8090/v1/queries/01ABC-DEF-456-7890AB/results/chunks/1 | jq . +``` + +## Using the CLI + +The `spice query` command provides a convenient CLI wrapper around the async queries API. + +### Submit and Wait + +```bash +spice query "SELECT * FROM data LIMIT 10;" +``` + +```console +Submitted query: 01ABC-DEF-456-7890AB (PENDING) +Waiting for completion... (Ctrl+C to stop waiting) +✓ SUCCEEDED (3.2s) ++----+---------+ +| id | value | ++----+---------+ +| 30 | value_0 | +| 31 | value_1 | +| 32 | value_2 | +| 33 | value_3 | +| 34 | value_4 | +| 35 | value_5 | +| 36 | value_6 | +| 37 | value_7 | +| 38 | value_8 | +| 39 | value_9 | ++----+---------+ + +Time: 3.20000000 seconds. 10 rows. +``` + +### Submit Without Waiting + +```bash +spice query "SELECT * FROM data;" --no-wait +``` + +```console +Submitted query: 01ABC-DEF-456-7890AB (PENDING) +Check status with: spice query status 01ABC-DEF-456-7890AB +Get results with: spice query results 01ABC-DEF-456-7890AB +``` + +### List Running Queries + +```bash +spice query list --status running +``` + +```console +QUERY ID STATUS CREATED SQL PREVIEW +01ABC-DEF-456-7890AB RUNNING 2026-03-02T12:00:00+00:00 SELECT * FROM data + +Total: 1 queries +``` + +### Cancel a Query + +```bash +spice query cancel 01ABC-DEF-456-7890AB +``` + +```console +Query 01ABC-DEF-456-7890AB cancelled (status: CANCELLED) +``` + +## Interactive REPL + +Start the REPL for a session-based workflow: + +```bash +spice query +``` + +```console +Welcome to the Spice.ai async query REPL. +Type SQL to submit a query, or .help for commands. + +query> SELECT COUNT(*) FROM data; +Submitted query: 01ABC-DEF-456-7890AB (PENDING) +Press Ctrl+C to stop waiting (query continues in background) +✓ SUCCEEDED (2.8s) ++----------+ +| count(*) | ++----------+ +| 100 | ++----------+ + +Time: 2.80000000 seconds. 1 rows. + +query> .list +QUERY ID STATUS SUBMITTED SQL +01ABC-DEF-456-7890AB SUCCEEDED 3s ago SELECT COUNT(*) FROM data; + +query> .exit +``` + +### REPL Commands + +| Command | Description | +| --------------- | -------------------------------------- | +| `.list` | List tracked queries from this session | +| `.status ` | Show query status | +| `.results ` | Fetch and display results | +| `.wait ` | Resume waiting for a query | +| `.cancel ` | Cancel a running query | +| `.help` | Show all commands | +| `.exit` | Exit the REPL | + +Partial query IDs are supported — `01ABC` resolves to the full ID if it uniquely matches one tracked query. + +## Advanced: Parameterized Queries + +Submit queries with bind parameters to safely include dynamic values: + +```bash +curl -s http://127.0.0.1:8090/v1/queries \ + -H "Content-Type: application/json" \ + -d '{ + "sql": "SELECT * FROM data WHERE id > $1 LIMIT $2", + "parameters": [50, 10] + }' | jq . +``` + +## Advanced: Timeouts and Size Limits + +Set a per-query timeout (the query is automatically cancelled on expiry): + +```bash +curl -s http://127.0.0.1:8090/v1/queries \ + -H "Content-Type: application/json" \ + -d '{ + "sql": "SELECT * FROM data", + "timeout_seconds": 30 + }' | jq . +``` + +Set a maximum result size (the query fails if results exceed it): + +```bash +curl -s http://127.0.0.1:8090/v1/queries \ + -H "Content-Type: application/json" \ + -d '{ + "sql": "SELECT * FROM data", + "maximum_size": 10485760 + }' | jq . +``` + +## How It Works + +1. **Submit**: A `POST /v1/queries` request creates a job in the scheduler's object store (configured via `scheduler.state_location`) and spawns a background task. +2. **Execute**: The background task submits the query to the Ballista distributed scheduler, which distributes execution across connected executors. +3. **Stream**: As result batches arrive from the executors, they are written to the object store in chunks of 10,000 rows. +4. **Complete**: The job is marked as `SUCCEEDED` with result metadata (schema, row count, chunk count). +5. **Retrieve**: Clients fetch results by chunk index. Results are available for 12 hours after completion. +6. **Cleanup**: Expired job results are periodically cleaned up from the object store. + +## Learn More + +- [Distributed Query — Async Queries API](https://spiceai.org/docs/features/distributed-query#async-queries-api) — Full HTTP and Flight API reference +- [Distributed Query](https://spiceai.org/docs/features/distributed-query) — Distributed multi-node SQL execution overview +- [Distributed Query Recipe](../distributed/README.md) — Setting up a distributed Spice cluster +- [`spice query` CLI](https://spiceai.org/docs/features/distributed-query#cli) — CLI command and REPL documentation diff --git a/async-queries/spicepod.yaml b/async-queries/spicepod.yaml new file mode 100644 index 00000000..1889b0f4 --- /dev/null +++ b/async-queries/spicepod.yaml @@ -0,0 +1,13 @@ +version: v1 +kind: Spicepod +name: async-queries + +runtime: + scheduler: + state_location: file://.data/scheduler-state + +datasets: + - from: s3://spiceai-public-datasets/hive_partitioned_data/ + name: data + params: + file_format: parquet