You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: website/blog/2025/amazon-s3-vectors-with-spice.mdx
+2-1Lines changed: 2 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -286,6 +286,7 @@ LIMIT 5;
286
286
This is particularly useful for read-heavy query workloads where hitting the main database adds latency. By storing the most commonly needed fields along with the vector, Spice’s vector search behaves like an **index-only query** (similar to covering indexes in relational databases). You trade a bit of extra storage in S3 (duplicating some fields, but still managed by Spice) for faster queries that bypass the heavier join.
287
287
288
288
This extends to `WHERE` conditions on non-filterable columns, or filter predicate unsupported by S3 vectors. Spice's execution engine can apply these filters, still avoiding any expensive JOIN on the underlying table.
@@ -302,7 +303,7 @@ Many real-world search applications go beyond a single-vector similarity lookup.
302
303
303
304
- **Multiple vector fields or multi-modal search:** You might have vectors for different aspects of data (e.g. an e-commerce product could have embeddings for both its description and the product's image. Or a document has both a title and body that should be searchable individually and together) that you may want to search across and combine results. Spice lets you do vector search on multiple columns easily, and you can weight the importance of each. For instance, you might boost matches in the title higher than matches in the body.
304
305
305
-
- **Vector and full-text search:** Similar to vector search, columns can have text indexes [defined](https://spiceai.org/docs/features/search/full-text) that enable full-text BM25 search. Text search can then be performed in SQL with a similar `text_search` [UDTF](https://spiceai.org/docs/features/search/full-text#sql-udtf). The `/v1/search` HTTP API will perform a **hybrid search** across both full-text and vector indexes, merging results using Reciprocal Rank Fusion (RRF). This means you get a balanced result set that accounts for direct keyword matches as well as semantic similarity. The example below demonstrates how RRF can be implemented in SQL by combining ranks.
306
+
- **Vector and full-text search:** Similar to vector search, columns can have text indexes [defined](https://spiceai.org/docs/features/search/full-text) that enable full-text BM25 search. Text search can then be performed in SQL with a similar `text_search` [UDTF](https://spiceai.org/docs/features/search/full-text#searching-with-sql). The `/v1/search` HTTP API will perform a **hybrid search** across both full-text and vector indexes, merging results using Reciprocal Rank Fusion (RRF). This means you get a balanced result set that accounts for direct keyword matches as well as semantic similarity. The example below demonstrates how RRF can be implemented in SQL by combining ranks.
306
307
307
308
- **Hybrid vector + keyword search:** Sometimes you want to ensure certain keywords are present while also using semantic similarity. Spice supports **hybrid search** natively – its default `/v1/search` HTTP API actually performs both full-text BM25 search and vector search, then merges results using Reciprocal Rank Fusion (RRF). This means you get a balanced result set that accounts for direct keyword matches as well as semantic similarity. In Spice’s SQL, you can also call text_search(dataset, query) for traditional full-text search, and combine it with vector_search results. The example below demonstrates how RRF can be implemented in SQL by combining ranks.
name: ice # tables from this catalog will be available in the "ice" catalog in Spice
22
22
include:
23
-
- "*.my_table_name"# include only the "my_table_name" tables
23
+
- '*.my_table_name'# include only the "my_table_name" tables
24
24
params:
25
25
iceberg_token: ${secrets:iceberg_token} # Optional. Bearer token value to use for Authorization header.
26
26
iceberg_oauth2_credential: ${secrets:client_id}:${secrets:client_secret} # Optional. Credential to use for OAuth2 client credential flow when initializing the catalog. Separated by a colon as <client_id>:<client_secret>.
@@ -90,25 +90,44 @@ Use the `include` field to specify which tables to include from the catalog. The
90
90
91
91
## `params`
92
92
93
-
The following parameters are supported for configuring the connection to the Iceberg catalog/tables:
94
-
95
-
| Parameter Name | Definition |
96
-
|---------------|------------|
97
-
| `iceberg_token` | Bearer token value to use for Authorization header. |
98
-
| `iceberg_oauth2_credential` | Credential to use for OAuth2 client credential flow when initializing the catalog. Format: `<client_id>:<client_secret>`|
99
-
| `iceberg_oauth2_scope` | Scope to use for OAuth2 client credential flow when initializing the catalog. Default: `catalog`|
100
-
| `iceberg_oauth2_server_url` | URL of the OAuth2 server tokens endpoint for the client credential flow. |
101
-
| `iceberg_s3_endpoint` | S3-compatible endpoint where the Iceberg tables are stored. |
102
-
| `iceberg_s3_region` | Region of the S3-compatible endpoint. |
103
-
| `iceberg_s3_access_key_id` | Access key ID for the S3-compatible endpoint. |
104
-
| `iceberg_s3_secret_access_key` | Secret access key for the S3-compatible endpoint. |
105
-
| `iceberg_s3_session_token` | Session token for the S3-compatible endpoint. |
106
-
| `iceberg_s3_role_arn` | ARN of the IAM role to assume when accessing the S3-compatible endpoint. |
107
-
| `iceberg_s3_role_session_name` | Session name to use when assuming the IAM role. |
108
-
| `iceberg_s3_connect_timeout` | Connection timeout in seconds for the S3-compatible endpoint. Default: `60`|
109
-
| `iceberg_sigv4_enabled` | Enable SigV4 (AWS Signature Version 4) authentication when connecting to the catalog. Automatically enabled if the URL in `from` is an AWS Glue catalog. Default: `false`|
110
-
| `iceberg_signing_region` | Region to use for SigV4 authentication. Extracted from the URL in `from` if not specified. |
111
-
| `iceberg_signing_name` | Service name to use for SigV4 authentication. Default: `glue`. |
93
+
The following parameters are supported for configuring the connection to the Iceberg catalog, file, or S3 storage:
| `iceberg_token` | Bearer token value to use for Authorization header. |
98
+
| `iceberg_oauth2_credential` | Credential to use for OAuth2 client credential flow when initializing the catalog. Separated by a colon as `<client_id>:<client_secret>`. |
99
+
| `iceberg_oauth2_token_url` | The URL to use for OAuth2 token endpoint. |
100
+
| `iceberg_oauth2_scope` | The scope to use for OAuth2 token endpoint (default: `catalog`). |
101
+
| `iceberg_oauth2_server_url` | URL of the OAuth2 server tokens endpoint. |
102
+
| `iceberg_sigv4_enabled` | Enable SigV4 authentication for the catalog (for connecting to AWS Glue). |
103
+
| `iceberg_signing_region` | The region to use when signing the request for SigV4. Defaults to the region in the catalog URL if available. |
104
+
| `iceberg_signing_name` | The name to use when signing the request for SigV4. Default: `glue`. |
105
+
| `iceberg_s3_endpoint` | Configure an alternative endpoint for the S3 service. This can be any S3-compatible object storage service (e.g., Minio, R2). |
106
+
| `iceberg_s3_access_key_id` | The AWS access key ID to use for S3 storage. |
107
+
| `iceberg_s3_secret_access_key` | The AWS secret access key to use for S3 storage. |
108
+
| `iceberg_s3_session_token` | Configure the static session token used for S3 storage. |
109
+
| `iceberg_s3_region` | The AWS S3 region to use. |
110
+
| `iceberg_s3_role_session_name` | An optional identifier for the assumed role session for auditing purposes. |
111
+
| `iceberg_s3_role_arn` | The Amazon Resource Name (ARN) of the role to assume. If provided instead of iceberg_s3_access_key_id and iceberg_s3_secret_access_key, temporary credentials will be fetched by assuming this role. |
The Iceberg Catalog Connector supports both REST Catalog and Hadoop Catalog endpoints. Hadoop Catalog endpoints use `file://`, `s3://`, or `s3a://` URLs to specify the warehouse path for the catalog.
115
+
116
+
Example using Hadoop Catalog with a local warehouse:
The GitHub Data Connector enables federated SQL queries on various GitHub resources such as files, issues, pull requests, and commits by specifying `github` as the selector in the `from` value for the dataset.
@@ -24,7 +24,6 @@ The `from` field specifies the GitHub resource to query. It supports the followi
24
24
|`github:github.com/<owner>/<repo>/stargazers`| Query stargazers from a repository |
25
25
|`github:github.com/<organization>/members`| Query members from an organization |
26
26
27
-
28
27
### `name`
29
28
30
29
The dataset name. This will be used as the table name within Spice. The dataset name cannot be a [reserved keyword](/docs/reference/spicepod/keywords.md).
@@ -76,6 +75,35 @@ With GitHub App Installation authentication, the connector's functionality depen
76
75
|`owner`| Required. Specifies the owner of the GitHub repository. |
77
76
|`repo`| Required. Specifies the name of the GitHub repository. |
78
77
78
+
## Advanced Configuration
79
+
80
+
When using multiple GitHub datasets sharing the same GitHub token or GitHub app credentials, it is possible to exceed GitHub's primary and secondary rate limits. To mitigate this, use the `github_max_concurrent_connections` runtime parameter. This connections limit applies per GitHub token and per GitHub app installation, following GitHub's rate limit policy.
81
+
82
+
Example Configuration:
83
+
84
+
```yaml
85
+
# ... other configuration ...
86
+
runtime:
87
+
params:
88
+
github_max_concurrent_connections: 5# Defaults to 10
0 commit comments