Skip to content

Commit e89360f

Browse files
authored
Merge pull request #1091 from spiceai/release/1.5.2
v1.5.2 Release Docs
2 parents b560153 + 24350d7 commit e89360f

44 files changed

Lines changed: 1542 additions & 386 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

website/babel.config.js

Lines changed: 0 additions & 3 deletions
This file was deleted.

website/blog/2025/amazon-s3-vectors-with-spice.mdx

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -286,6 +286,7 @@ LIMIT 5;
286286
This is particularly useful for read-heavy query workloads where hitting the main database adds latency. By storing the most commonly needed fields along with the vector, Spice’s vector search behaves like an **index-only query** (similar to covering indexes in relational databases). You trade a bit of extra storage in S3 (duplicating some fields, but still managed by Spice) for faster queries that bypass the heavier join.
287287

288288
This extends to `WHERE` conditions on non-filterable columns, or filter predicate unsupported by S3 vectors. Spice's execution engine can apply these filters, still avoiding any expensive JOIN on the underlying table.
289+
289290
```sql
290291
SELECT review_id, rating, customer_id, body, score
291292
FROM vector_query_results
@@ -302,7 +303,7 @@ Many real-world search applications go beyond a single-vector similarity lookup.
302303

303304
- **Multiple vector fields or multi-modal search:** You might have vectors for different aspects of data (e.g. an e-commerce product could have embeddings for both its description and the product's image. Or a document has both a title and body that should be searchable individually and together) that you may want to search across and combine results. Spice lets you do vector search on multiple columns easily, and you can weight the importance of each. For instance, you might boost matches in the title higher than matches in the body.
304305

305-
- **Vector and full-text search:** Similar to vector search, columns can have text indexes [defined](https://spiceai.org/docs/features/search/full-text) that enable full-text BM25 search. Text search can then be performed in SQL with a similar `text_search` [UDTF](https://spiceai.org/docs/features/search/full-text#sql-udtf). The `/v1/search` HTTP API will perform a **hybrid search** across both full-text and vector indexes, merging results using Reciprocal Rank Fusion (RRF). This means you get a balanced result set that accounts for direct keyword matches as well as semantic similarity. The example below demonstrates how RRF can be implemented in SQL by combining ranks.
306+
- **Vector and full-text search:** Similar to vector search, columns can have text indexes [defined](https://spiceai.org/docs/features/search/full-text) that enable full-text BM25 search. Text search can then be performed in SQL with a similar `text_search` [UDTF](https://spiceai.org/docs/features/search/full-text#searching-with-sql). The `/v1/search` HTTP API will perform a **hybrid search** across both full-text and vector indexes, merging results using Reciprocal Rank Fusion (RRF). This means you get a balanced result set that accounts for direct keyword matches as well as semantic similarity. The example below demonstrates how RRF can be implemented in SQL by combining ranks.
306307

307308
- **Hybrid vector + keyword search:** Sometimes you want to ensure certain keywords are present while also using semantic similarity. Spice supports **hybrid search** natively – its default `/v1/search` HTTP API actually performs both full-text BM25 search and vector search, then merges results using Reciprocal Rank Fusion (RRF). This means you get a balanced result set that accounts for direct keyword matches as well as semantic similarity. In Spice’s SQL, you can also call text_search(dataset, query) for traditional full-text search, and combine it with vector_search results. The example below demonstrates how RRF can be implemented in SQL by combining ranks.
308309

website/docs/components/catalogs/iceberg.md

Lines changed: 39 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ catalogs:
2020
- from: iceberg:https://iceberg-catalog-host.com/v1/namespaces/my_catalog
2121
name: ice # tables from this catalog will be available in the "ice" catalog in Spice
2222
include:
23-
- "*.my_table_name" # include only the "my_table_name" tables
23+
- '*.my_table_name' # include only the "my_table_name" tables
2424
params:
2525
iceberg_token: ${secrets:iceberg_token} # Optional. Bearer token value to use for Authorization header.
2626
iceberg_oauth2_credential: ${secrets:client_id}:${secrets:client_secret} # Optional. Credential to use for OAuth2 client credential flow when initializing the catalog. Separated by a colon as <client_id>:<client_secret>.
@@ -90,25 +90,44 @@ Use the `include` field to specify which tables to include from the catalog. The
9090

9191
## `params`
9292

93-
The following parameters are supported for configuring the connection to the Iceberg catalog/tables:
94-
95-
| Parameter Name | Definition |
96-
|---------------|------------|
97-
| `iceberg_token` | Bearer token value to use for Authorization header. |
98-
| `iceberg_oauth2_credential` | Credential to use for OAuth2 client credential flow when initializing the catalog. Format: `<client_id>:<client_secret>` |
99-
| `iceberg_oauth2_scope` | Scope to use for OAuth2 client credential flow when initializing the catalog. Default: `catalog` |
100-
| `iceberg_oauth2_server_url` | URL of the OAuth2 server tokens endpoint for the client credential flow. |
101-
| `iceberg_s3_endpoint` | S3-compatible endpoint where the Iceberg tables are stored. |
102-
| `iceberg_s3_region` | Region of the S3-compatible endpoint. |
103-
| `iceberg_s3_access_key_id` | Access key ID for the S3-compatible endpoint. |
104-
| `iceberg_s3_secret_access_key` | Secret access key for the S3-compatible endpoint. |
105-
| `iceberg_s3_session_token` | Session token for the S3-compatible endpoint. |
106-
| `iceberg_s3_role_arn` | ARN of the IAM role to assume when accessing the S3-compatible endpoint. |
107-
| `iceberg_s3_role_session_name` | Session name to use when assuming the IAM role. |
108-
| `iceberg_s3_connect_timeout` | Connection timeout in seconds for the S3-compatible endpoint. Default: `60` |
109-
| `iceberg_sigv4_enabled` | Enable SigV4 (AWS Signature Version 4) authentication when connecting to the catalog. Automatically enabled if the URL in `from` is an AWS Glue catalog. Default: `false` |
110-
| `iceberg_signing_region` | Region to use for SigV4 authentication. Extracted from the URL in `from` if not specified. |
111-
| `iceberg_signing_name` | Service name to use for SigV4 authentication. Default: `glue`. |
93+
The following parameters are supported for configuring the connection to the Iceberg catalog, file, or S3 storage:
94+
95+
| Parameter Name | Description |
96+
| ------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
97+
| `iceberg_token` | Bearer token value to use for Authorization header. |
98+
| `iceberg_oauth2_credential` | Credential to use for OAuth2 client credential flow when initializing the catalog. Separated by a colon as `<client_id>:<client_secret>`. |
99+
| `iceberg_oauth2_token_url` | The URL to use for OAuth2 token endpoint. |
100+
| `iceberg_oauth2_scope` | The scope to use for OAuth2 token endpoint (default: `catalog`). |
101+
| `iceberg_oauth2_server_url` | URL of the OAuth2 server tokens endpoint. |
102+
| `iceberg_sigv4_enabled` | Enable SigV4 authentication for the catalog (for connecting to AWS Glue). |
103+
| `iceberg_signing_region` | The region to use when signing the request for SigV4. Defaults to the region in the catalog URL if available. |
104+
| `iceberg_signing_name` | The name to use when signing the request for SigV4. Default: `glue`. |
105+
| `iceberg_s3_endpoint` | Configure an alternative endpoint for the S3 service. This can be any S3-compatible object storage service (e.g., Minio, R2). |
106+
| `iceberg_s3_access_key_id` | The AWS access key ID to use for S3 storage. |
107+
| `iceberg_s3_secret_access_key` | The AWS secret access key to use for S3 storage. |
108+
| `iceberg_s3_session_token` | Configure the static session token used for S3 storage. |
109+
| `iceberg_s3_region` | The AWS S3 region to use. |
110+
| `iceberg_s3_role_session_name` | An optional identifier for the assumed role session for auditing purposes. |
111+
| `iceberg_s3_role_arn` | The Amazon Resource Name (ARN) of the role to assume. If provided instead of iceberg_s3_access_key_id and iceberg_s3_secret_access_key, temporary credentials will be fetched by assuming this role. |
112+
| `iceberg_s3_connect_timeout` | Configure socket connection timeout, in seconds (default: `60`). |
113+
114+
The Iceberg Catalog Connector supports both REST Catalog and Hadoop Catalog endpoints. Hadoop Catalog endpoints use `file://`, `s3://`, or `s3a://` URLs to specify the warehouse path for the catalog.
115+
116+
Example using Hadoop Catalog with a local warehouse:
117+
118+
```yaml
119+
catalogs:
120+
- from: iceberg:file:///tmp/hadoop_warehouse/
121+
name: local_hadoop
122+
```
123+
124+
Example using Hadoop Catalog with S3:
125+
126+
```yaml
127+
catalogs:
128+
- from: iceberg:s3a://my-bucket/hadoop_warehouse/
129+
name: s3_hadoop
130+
```
112131

113132
## Cookbook
114133

website/docs/components/data-connectors/github.md

Lines changed: 30 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: 'GitHub Data Connector'
33
sidebar_label: 'GitHub Data Connector'
44
description: 'GitHub Data Connector Documentation'
5-
tags: ['data-connector', 'github', 'sql', 'api', 'integration']
5+
tags: ['data-connectors', 'github', 'sql', 'api', 'integration']
66
---
77

88
The GitHub Data Connector enables federated SQL queries on various GitHub resources such as files, issues, pull requests, and commits by specifying `github` as the selector in the `from` value for the dataset.
@@ -24,7 +24,6 @@ The `from` field specifies the GitHub resource to query. It supports the followi
2424
| `github:github.com/<owner>/<repo>/stargazers` | Query stargazers from a repository |
2525
| `github:github.com/<organization>/members` | Query members from an organization |
2626

27-
2827
### `name`
2928

3029
The dataset name. This will be used as the table name within Spice. The dataset name cannot be a [reserved keyword](/docs/reference/spicepod/keywords.md).
@@ -76,6 +75,35 @@ With GitHub App Installation authentication, the connector's functionality depen
7675
| `owner` | Required. Specifies the owner of the GitHub repository. |
7776
| `repo` | Required. Specifies the name of the GitHub repository. |
7877

78+
## Advanced Configuration
79+
80+
When using multiple GitHub datasets sharing the same GitHub token or GitHub app credentials, it is possible to exceed GitHub's primary and secondary rate limits. To mitigate this, use the `github_max_concurrent_connections` runtime parameter. This connections limit applies per GitHub token and per GitHub app installation, following GitHub's rate limit policy.
81+
82+
Example Configuration:
83+
84+
```yaml
85+
# ... other configuration ...
86+
runtime:
87+
params:
88+
github_max_concurrent_connections: 5 # Defaults to 10
89+
90+
datasets:
91+
- from: github:github.com/spiceai/spiceai/files/v0.17.2-beta
92+
name: spiceai.files
93+
params:
94+
github_token: ${secrets:GITHUB_TOKEN}
95+
include: '**/*.txt'
96+
acceleration:
97+
enabled: true
98+
- from: github:github.com/<owner>/<repo>/issues
99+
name: spiceai.issues
100+
params:
101+
github_token: ${secrets:GITHUB_TOKEN}
102+
acceleration:
103+
enabled: true
104+
# ... other configuration ...
105+
```
106+
79107
## Filter Push Down
80108

81109
GitHub queries support a `github_query_mode` parameter, which can be set to either `auto` or `search` for the following types:

0 commit comments

Comments
 (0)