You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: website/docs/components/data-accelerators/duckdb.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -42,6 +42,7 @@ DuckDB acceleration supports the following optional parameters under `accelerati
42
42
- `on_refresh_recompute_statistics` (string, default: `enabled`): Triggers automatic `ANALYZE` execution after data refreshes. This keeps DuckDB optimizer statistics up-to-date for efficient query plans and performance. Set to `disabled` to turn automatic statistics recomputation off. See [DuckDB ANALYZE statement documentation](https://duckdb.org/docs/stable/sql/statements/analyze).
43
43
- `partition_mode` (string, default: `files`): Controls how partitioned data is stored. Can only be used with `partition_by`. Set to `tables` to store partitions as separate tables within a single DuckDB database, improving resource usage through single shared connection pool for all partitions. Default `files` mode creates separate database files per partition with individual connection pools and generally faster query performance.
44
44
- `duckdb_partitioned_write_flush_threshold` (integer, default: `122880`): The number of rows buffered per partition before flushing data to acceleration storage. Only applicable when using `partition_mode: tables`. Using a larger value can improve write performance but requires more memory.
45
+
- `optimizer_duckdb_aggregate_pushdown` (string, default: `disabled`): Enables aggregate pushdown optimization to execute supported aggregate queries directly in DuckDB. Set to `enabled` to push down aggregations for improved query performance on supported functions like `count`, `sum`, `avg`, `min`, and `max`. Requires `query_federation` to be `disabled`.
45
46
46
47
Refer to the [datasets configuration reference](/docs/reference/spicepod/datasets.md#acceleration) for additional supported fields.
Copy file name to clipboardExpand all lines: website/docs/components/data-connectors/dynamodb.md
+105Lines changed: 105 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,6 +6,7 @@ tags:
6
6
- data-connectors
7
7
- dynamodb
8
8
- nosql
9
+
- component-metrics
9
10
---
10
11
11
12
Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. This connector enables using DynamoDB tables as data sources for federated SQL queries in Spice.
The DynamoDB Data Connector integrates with [DynamoDB Streams](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html) to enable real-time streaming of table changes. This feature supports both initial table bootstrapping and continuous change data capture (CDC), allowing Spice to automatically detect and stream inserts, updates, and deletes from DynamoDB tables.
436
+
437
+
:::warning
438
+
439
+
Using the DynamoDB connector **requires** [acceleration](/docs/components/data-accelerators/index.md) with `refresh_mode: changes` and defined `on_conflict` configuration.
440
+
441
+
:::
442
+
443
+
### Basic Configuration
444
+
445
+
To enable streaming from DynamoDB, enable acceleration and set the `refresh_mode` to `changes` in your dataset configuration.
446
+
447
+
You also need to configure the `on_conflict` parameter to specify how the connector should handle updates to existing records. The keys defined in `on_conflict` must match your DynamoDB table's partition key and range key (if your table has one)
448
+
```yaml
449
+
datasets:
450
+
- from: dynamodb:my_table
451
+
name: orders_stream
452
+
acceleration:
453
+
enabled: true
454
+
engine: duckdb
455
+
mode: file
456
+
refresh_mode: changes
457
+
on_conflict:
458
+
(id, version): upsert
459
+
```
460
+
461
+
### Configuration Parameters
462
+
463
+
#### Dataset Parameters
464
+
465
+
- **`ready_lag`** - Defines the maximum lag threshold before the dataset is reported as "Ready". Once the stream lag falls below this value, queries can be executed against the dataset. Default behavior reports ready immediately after bootstrap completes.
466
+
467
+
- **`scan_interval`** - Controls the polling frequency for checking new records in the DynamoDB stream. Lower values provide more real-time updates but increase API calls. Higher values reduce API usage but may introduce additional latency.
468
+
469
+
#### Acceleration Parameters
470
+
471
+
- **`on_conflict`** - Specifies the conflict resolution strategy when streaming changes that match existing records. The keys in the tuple should correspond to your DynamoDB table's partition key and range key (if applicable). The `upsert` action will insert new records or update existing ones based on these key columns.
472
+
473
+
**Examples:**
474
+
- Single partition key: `id: upsert`
475
+
- Partition key + range key: `(partition_key, sort_key): upsert`
476
+
477
+
- **`snapshots_trigger_threshold`** - Determines how frequently snapshots are created during streaming. A value of `5` means a snapshot is created every 5 batch updates. Snapshots enable faster recovery and better query performance but consume additional storage.
478
+
479
+
### Metrics
480
+
481
+
The following [Component Metrics](../../features/observability/component_metrics.md) are provided for monitoring streaming performance and health:
| `shards_active` | Gauge | Current number of active shards in the stream |
486
+
| `records_consumed_total` | Counter | Total number of records consumed from the stream |
487
+
| `lag_ms` | Gauge | Current lag in milliseconds between stream watermark and the current time |
488
+
| `errors_transient_total` | Counter | Total number of transient errors encountered while polling from the stream |
489
+
490
+
These metrics are not enabled by default, enable them by setting the metrics parameter:
491
+
```yaml
492
+
datasets:
493
+
- from: kafka:user_events
494
+
name: events
495
+
metrics:
496
+
- name: shards_active
497
+
- name: lag_ms
498
+
```
499
+
500
+
You can find an example dashboard for DynamoDB Streams in [monitoring/grafana-dashboard.json](https://github.com/spiceai/spiceai/blob/trunk/monitoring/grafana-dashboard.json).
501
+
502
+
## Advanced Configuration
503
+
504
+
For production workloads requiring fine-tuned control over streaming behavior and performance characteristics:
505
+
```yaml
506
+
datasets:
507
+
- from: dynamodb:my_table
508
+
name: orders_stream
509
+
params:
510
+
ready_lag: 1s # Dataset reports as Ready when lag is below 1 second
511
+
scan_interval: 100ms # Poll for new stream records every 100 milliseconds
512
+
acceleration:
513
+
enabled: true
514
+
engine: duckdb
515
+
mode: file
516
+
refresh_mode: changes
517
+
on_conflict:
518
+
(id, version): upsert
519
+
params:
520
+
snapshots_trigger_threshold: 5 # Create snapshot every 5 batch updates
521
+
metrics:
522
+
- name: shards_active
523
+
enabled: true
524
+
- name: records_consumed_total
525
+
enabled: true
526
+
- name: lag_ms
527
+
enabled: true
528
+
- name: errors_transient_total
529
+
enabled: true
530
+
```
531
+
532
+
## Cookbooks
533
+
534
+
- A cookbook recipe to configure DynamoDB as a data connector in Spice. [DynamoDB Data Connector](https://github.com/spiceai/cookbook/tree/trunk/dynamodb#readme)
535
+
- A cookbook recipe to configure DynamoDB Streams as a data connector in Spice. [DynamoDB Streams Data Connector](https://github.com/spiceai/cookbook/tree/trunk/dynamodb/streams#readme)
| `max_size` | Yes | `128MiB` | Maximum cache size. Defaults to `128MiB`. |
52
-
| `eviction_policy` | Yes | `lru` | Cache replacement policy when the cache reaches `max_size`. Defaults to `lru`, which is currently the only supported value. |
53
-
| `item_ttl` | Yes | `1s` | Cache entry expiration duration (Time to Live). Defaults to 1 second. |
54
-
| `hashing_algorithm` | Yes | `siphash` | Selects which hashing algorithm is used to hash the cache keys when storing the results. Defaults to `siphash`. Supports `siphash` or `ahash`. |
48
+
| Parameter name | Optional | Default | Description |
| `max_size` | Yes | `128MiB` | Maximum cache size. Defaults to `128MiB`. |
52
+
| `eviction_policy` | Yes | `lru` | Cache replacement policy when the cache reaches `max_size`. Defaults to `lru`, which is currently the only supported value. |
53
+
| `item_ttl` | Yes | `1s` | Cache entry expiration duration (Time to Live). Defaults to 1 second. |
54
+
| `hashing_algorithm` | Yes | `xxh3` | Selects which hashing algorithm is used to hash the cache keys when storing the results. Defaults to `xxh3`. Supports `xxh3`, `ahash`, `siphash`, `blake3`, `xxh32`, `xxh64`, or `xxh128`. |
55
55
56
56
## `caching.sql_results` Parameters
57
57
58
58
In addition to the common caching parameters, `sql_results` also supports additional parameters:
59
59
60
-
| Parameter name | Optional | Default | Description |
| `cache_key_type` | Yes | `plan` | Determines how cache keys are generated. Defaults to `plan`. `plan` uses the query's logical plan, while `sql` uses the raw SQL query string. |
63
-
| `encoding` | Yes | `none` | Compression algorithm for cached results. Defaults to `none`. Supports `none` or `zstd`. |
60
+
| Parameter name | Optional | Default | Description |
| `cache_key_type` | Yes | `plan` | Determines how cache keys are generated. Defaults to `plan`. `plan` uses the query's logical plan, while `sql` uses the raw SQL query string. |
63
+
| `encoding` | Yes | `none` | Compression algorithm for cached results. Defaults to `none`. Supports `none` or `zstd`. |
64
64
| `stale_while_revalidate_ttl` | Yes | `0s` | Duration to serve stale cache entries while revalidating in the background. When set to a non-zero value, expired cache entries continue to be served while a background refresh occurs. Defaults to `0s` (disabled). |
65
65
66
66
### Choosing a `cache_key_type`
@@ -74,10 +74,13 @@ Use `sql` for the lowest latency with identical queries that do not include dyna
74
74
75
75
The hashing algorithm determines how cache keys are hashed before being stored, impacting both lookup speed and protection against potential DOS attacks.
76
76
77
-
- **`siphash` (Default):** Uses the SipHash1-3 algorithm for hashing the cache keys, the [default hashing algorithm of Rust](https://github.com/rust-lang/rust/commit/db1b1919baba8be48d997d9f70a6a5df7e31612a). This hashing algorithm is a secure algorithm that implements verified protections against ["hash flooding"](https://v8.dev/blog/hash-flooding) denial of service (DoS) attacks. Reasonably performant, and provides a high level of security.
77
+
- **`xxh3` (Default):** Uses the [XXH3](https://cyan4973.github.io/xxHash/) algorithm for hashing the cache keys. XXH3 is a fast, non-cryptographic hash algorithm that provides high performance and good distribution. It is suitable for scenarios where speed is critical and cryptographic security is not required.
78
+
- **`siphash`:** Uses the SipHash1-3 algorithm for hashing the cache keys, the [default hashing algorithm of Rust](https://github.com/rust-lang/rust/commit/db1b1919baba8be48d997d9f70a6a5df7e31612a). This hashing algorithm is a secure algorithm that implements verified protections against ["hash flooding"](https://v8.dev/blog/hash-flooding) denial of service (DoS) attacks. Reasonably performant, and provides a high level of security.
78
79
- **`ahash`:** Uses the [AHash](https://github.com/tkaitchuck/ahash) algorithm for hashing the cache keys. The AHash algorithm is a [high quality](https://github.com/tkaitchuck/aHash/blob/master/compare/readme.md#Quality) hashing algorithm, and has claimed resistance against hashing DoS attacks. AHash has higher performance than SipHash1-3, especially when used with `cache_key_type: plan`.
80
+
- **`blake3`:** Uses the [BLAKE3](https://github.com/BLAKE3-team/BLAKE3) cryptographic hash function. BLAKE3 is a fast, parallelizable hash function that provides cryptographic security while maintaining high performance. It is suitable for scenarios requiring both speed and cryptographic guarantees.
81
+
- **`xxh32`, `xxh64`, `xxh128`:** Variants of the XXH hashing algorithm with different output sizes. These algorithms offer a balance between speed and collision resistance, with larger hash sizes providing better collision resistance at the cost of performance.
79
82
80
-
Consider using `ahash` if maximum performance is most important, or where hashing DoS attacks are unlikely or a low risk. More information on the security mechanisms of AHash are available [in the AHash documentation](https://github.com/tkaitchuck/aHash/wiki/How-aHash-is-resists-DOS-attacks).
83
+
Use `xxh3` (the default) for its superior speed in most scenarios. Use `ahash`, `xxh64` or `xxh128` for reduced collision probability when caching a large number of queries. Use `blake3` when cryptographic security is required. Use `siphash` when protection against hash flooding attacks is a priority.
81
84
82
85
### Choosing an `encoding`
83
86
@@ -192,6 +195,16 @@ With this configuration:
192
195
193
196
This approach is particularly useful for queries that take significant time to execute, providing a better user experience by reducing perceived latency while keeping data reasonably fresh.
194
197
198
+
:::warning[Conflict with Caching Accelerator SWR]
199
+
When using a dataset with `refresh_mode: caching`, you cannot configure both the results cache's `stale_while_revalidate_ttl` and the caching accelerator's `caching_stale_while_revalidate_ttl` for the same dataset. These parameters control similar behavior at different layers.
@@ -125,6 +130,12 @@ Appending modified files is only supported for datasets that support setting the
125
130
126
131
Datasets configured with acceleration `refresh_mode: changes` requires a [Change Data Capture (CDC)](/docs/features/cdc/index.md) supported data connector. Initial CDC support in Spice is supported by the [Debezium data connector](/docs/components/data-connectors/debezium.md).
127
132
133
+
### Caching
134
+
135
+
The `caching` refresh mode is designed for HTTP-based datasets where request metadata acts as cache keys. This mode is particularly useful for API responses that return multiple rows for a single request, such as search results or dynamic content endpoints.
136
+
137
+
See [Caching Mode](/docs/features/data-acceleration/refresh-modes/caching.md) for detailed documentation and examples.
138
+
128
139
## Ready State
129
140
130
141
| | |
@@ -425,7 +436,7 @@ datasets:
425
436
acceleration:
426
437
enabled: true
427
438
refresh_mode: full
428
-
refresh_cron: "0 12 * * 1-5"
439
+
refresh_cron: '0 12 * * 1-5'
429
440
```
430
441
431
442
This configuration will refresh `taxi_trips` data at midday every weekday. For more information about cron schedules, see the [cron schedule reference](/docs/reference/cron.md).
0 commit comments