spiceai
diff --git a/‎website/docs/components/data-accelerators/duckdb.md‎
Lines changed: 1 addition & 0 deletions b/‎website/docs/components/data-accelerators/duckdb.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎website/docs/components/data-connectors/dynamodb.md‎
Lines changed: 105 additions & 0 deletions b/‎website/docs/components/data-connectors/dynamodb.md‎
Lines changed: 105 additions & 0 deletions
diff --git a/‎website/docs/features/caching/index.md‎
Lines changed: 26 additions & 13 deletions b/‎website/docs/features/caching/index.md‎
Lines changed: 26 additions & 13 deletions
diff --git a/‎website/docs/features/data-acceleration/data-refresh.md‎
Lines changed: 13 additions & 2 deletions b/‎website/docs/features/data-acceleration/data-refresh.md‎
Lines changed: 13 additions & 2 deletions
@@ -42,6 +42,7 @@ DuckDB acceleration supports the following optional parameters under `accelerati
 - `on_refresh_recompute_statistics` (string, default: `enabled`): Triggers automatic `ANALYZE` execution after data refreshes. This keeps DuckDB optimizer statistics up-to-date for efficient query plans and performance. Set to `disabled` to turn automatic statistics recomputation off. See [DuckDB ANALYZE statement documentation](https://duckdb.org/docs/stable/sql/statements/analyze).
 - `partition_mode` (string, default: `files`): Controls how partitioned data is stored. Can only be used with `partition_by`. Set to `tables` to store partitions as separate tables within a single DuckDB database, improving resource usage through single shared connection pool for all partitions. Default `files` mode creates separate database files per partition with individual connection pools and generally faster query performance.
 - `duckdb_partitioned_write_flush_threshold` (integer, default: `122880`): The number of rows buffered per partition before flushing data to acceleration storage. Only applicable when using `partition_mode: tables`. Using a larger value can improve write performance but requires more memory.
+- `optimizer_duckdb_aggregate_pushdown` (string, default: `disabled`): Enables aggregate pushdown optimization to execute supported aggregate queries directly in DuckDB. Set to `enabled` to push down aggregations for improved query performance on supported functions like `count`, `sum`, `avg`, `min`, and `max`. Requires `query_federation` to be `disabled`.
 
 Refer to the [datasets configuration reference](/docs/reference/spicepod/datasets.md#acceleration) for additional supported fields.
 
 
@@ -6,6 +6,7 @@ tags:
   - data-connectors
   - dynamodb
   - nosql
+  - component-metrics
 ---
 
 Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. This connector enables using DynamoDB tables as data sources for federated SQL queries in Spice.
@@ -428,3 +429,107 @@ describe users;
 ...
 +----------------+------------------+-------------+
 ```
+
+## Streams
+
+The DynamoDB Data Connector integrates with [DynamoDB Streams](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html) to enable real-time streaming of table changes. This feature supports both initial table bootstrapping and continuous change data capture (CDC), allowing Spice to automatically detect and stream inserts, updates, and deletes from DynamoDB tables.
+
+:::warning
+
+Using the DynamoDB connector **requires** [acceleration](/docs/components/data-accelerators/index.md) with `refresh_mode: changes` and defined `on_conflict` configuration.
+
+:::
+
+### Basic Configuration
+
+To enable streaming from DynamoDB, enable acceleration and set the `refresh_mode` to `changes` in your dataset configuration.
+
+You also need to configure the `on_conflict` parameter to specify how the connector should handle updates to existing records. The keys defined in `on_conflict` must match your DynamoDB table's partition key and range key (if your table has one)
+```yaml
+datasets:
+  - from: dynamodb:my_table
+    name: orders_stream
+    acceleration:
+      enabled: true
+      engine: duckdb
+      mode: file
+      refresh_mode: changes
+      on_conflict:
+        (id, version): upsert
+```
+
+### Configuration Parameters
+
+#### Dataset Parameters
+
+- **`ready_lag`** - Defines the maximum lag threshold before the dataset is reported as "Ready". Once the stream lag falls below this value, queries can be executed against the dataset. Default behavior reports ready immediately after bootstrap completes.
+
+- **`scan_interval`** - Controls the polling frequency for checking new records in the DynamoDB stream. Lower values provide more real-time updates but increase API calls. Higher values reduce API usage but may introduce additional latency.
+
+#### Acceleration Parameters
+
+- **`on_conflict`** - Specifies the conflict resolution strategy when streaming changes that match existing records. The keys in the tuple should correspond to your DynamoDB table's partition key and range key (if applicable). The `upsert` action will insert new records or update existing ones based on these key columns.
+
+  **Examples:**
+   - Single partition key: `id: upsert`
+   - Partition key + range key: `(partition_key, sort_key): upsert`
+
+- **`snapshots_trigger_threshold`** - Determines how frequently snapshots are created during streaming. A value of `5` means a snapshot is created every 5 batch updates. Snapshots enable faster recovery and better query performance but consume additional storage.
+
+### Metrics
+
+The following [Component Metrics](../../features/observability/component_metrics.md) are provided for monitoring streaming performance and health:
+
+| Metric                   | Type    | Description                                                                 |
+|--------------------------|---------|-----------------------------------------------------------------------------|
+| `shards_active`          | Gauge   | Current number of active shards in the stream                               |
+| `records_consumed_total` | Counter | Total number of records consumed from the stream                            |
+| `lag_ms`                 | Gauge   | Current lag in milliseconds between stream watermark and the current time   |
+| `errors_transient_total` | Counter | Total number of transient errors encountered while polling from the stream  |
+
+These metrics are not enabled by default, enable them by setting the metrics parameter:
+```yaml
+datasets:
+- from: kafka:user_events
+  name: events
+  metrics:
+   - name: shards_active
+   - name: lag_ms
+```
+
+You can find an example dashboard for DynamoDB Streams in [monitoring/grafana-dashboard.json](https://github.com/spiceai/spiceai/blob/trunk/monitoring/grafana-dashboard.json).
+
+## Advanced Configuration
+
+For production workloads requiring fine-tuned control over streaming behavior and performance characteristics:
+```yaml
+datasets:
+   - from: dynamodb:my_table
+     name: orders_stream
+     params:
+        ready_lag: 1s          # Dataset reports as Ready when lag is below 1 second
+        scan_interval: 100ms   # Poll for new stream records every 100 milliseconds
+     acceleration:
+        enabled: true
+        engine: duckdb
+        mode: file
+        refresh_mode: changes
+        on_conflict:
+           (id, version): upsert
+        params:
+           snapshots_trigger_threshold: 5  # Create snapshot every 5 batch updates
+     metrics:
+     - name: shards_active
+       enabled: true
+     - name: records_consumed_total
+       enabled: true
+     - name: lag_ms
+       enabled: true
+     - name: errors_transient_total
+       enabled: true
+```
+
+## Cookbooks
+
+- A cookbook recipe to configure DynamoDB as a data connector in Spice. [DynamoDB Data Connector](https://github.com/spiceai/cookbook/tree/trunk/dynamodb#readme)
+- A cookbook recipe to configure DynamoDB Streams as a data connector in Spice. [DynamoDB Streams Data Connector](https://github.com/spiceai/cookbook/tree/trunk/dynamodb/streams#readme)
@@ -45,22 +45,22 @@ runtime:
 
 Every cache type (`sql_results`, `search_results`, `embeddings`) supports the following parameters:
 
-| Parameter name               | Optional | Default   | Description                                                                                                                                                                                                           |
-| ---------------------------- | -------- | --------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `enabled`                    | Yes      | `true`    | Defaults to `true`.                                                                                                                                                                                                   |
-| `max_size`                   | Yes      | `128MiB`  | Maximum cache size. Defaults to `128MiB`.                                                                                                                                                                             |
-| `eviction_policy`            | Yes      | `lru`     | Cache replacement policy when the cache reaches `max_size`. Defaults to `lru`, which is currently the only supported value.                                                                                           |
-| `item_ttl`                   | Yes      | `1s`      | Cache entry expiration duration (Time to Live). Defaults to 1 second.                                                                                                                                                 |
-| `hashing_algorithm`          | Yes      | `siphash` | Selects which hashing algorithm is used to hash the cache keys when storing the results. Defaults to `siphash`. Supports `siphash` or `ahash`.                                                                        |
+| Parameter name      | Optional | Default  | Description                                                                                                                                                                                |
+| ------------------- | -------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `enabled`           | Yes      | `true`   | Defaults to `true`.                                                                                                                                                                        |
+| `max_size`          | Yes      | `128MiB` | Maximum cache size. Defaults to `128MiB`.                                                                                                                                                  |
+| `eviction_policy`   | Yes      | `lru`    | Cache replacement policy when the cache reaches `max_size`. Defaults to `lru`, which is currently the only supported value.                                                                |
+| `item_ttl`          | Yes      | `1s`     | Cache entry expiration duration (Time to Live). Defaults to 1 second.                                                                                                                      |
+| `hashing_algorithm` | Yes      | `xxh3`   | Selects which hashing algorithm is used to hash the cache keys when storing the results. Defaults to `xxh3`. Supports `xxh3`, `ahash`, `siphash`, `blake3`, `xxh32`, `xxh64`, or `xxh128`. |
 
 ## `caching.sql_results` Parameters
 
 In addition to the common caching parameters, `sql_results` also supports additional parameters:
 
-| Parameter name   | Optional | Default | Description                                                                                                                                   |
-| ---------------- | -------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
-| `cache_key_type`             | Yes      | `plan`  | Determines how cache keys are generated. Defaults to `plan`. `plan` uses the query's logical plan, while `sql` uses the raw SQL query string. |
-| `encoding`                   | Yes      | `none`  | Compression algorithm for cached results. Defaults to `none`. Supports `none` or `zstd`.                                                      |
+| Parameter name               | Optional | Default | Description                                                                                                                                                                                                           |
+| ---------------------------- | -------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `cache_key_type`             | Yes      | `plan`  | Determines how cache keys are generated. Defaults to `plan`. `plan` uses the query's logical plan, while `sql` uses the raw SQL query string.                                                                         |
+| `encoding`                   | Yes      | `none`  | Compression algorithm for cached results. Defaults to `none`. Supports `none` or `zstd`.                                                                                                                              |
 | `stale_while_revalidate_ttl` | Yes      | `0s`    | Duration to serve stale cache entries while revalidating in the background. When set to a non-zero value, expired cache entries continue to be served while a background refresh occurs. Defaults to `0s` (disabled). |
 
 ### Choosing a `cache_key_type`
@@ -74,10 +74,13 @@ Use `sql` for the lowest latency with identical queries that do not include dyna
 
 The hashing algorithm determines how cache keys are hashed before being stored, impacting both lookup speed and protection against potential DOS attacks.
 
-- **`siphash` (Default):** Uses the SipHash1-3 algorithm for hashing the cache keys, the [default hashing algorithm of Rust](https://github.com/rust-lang/rust/commit/db1b1919baba8be48d997d9f70a6a5df7e31612a). This hashing algorithm is a secure algorithm that implements verified protections against ["hash flooding"](https://v8.dev/blog/hash-flooding) denial of service (DoS) attacks. Reasonably performant, and provides a high level of security.
+- **`xxh3` (Default):** Uses the [XXH3](https://cyan4973.github.io/xxHash/) algorithm for hashing the cache keys. XXH3 is a fast, non-cryptographic hash algorithm that provides high performance and good distribution. It is suitable for scenarios where speed is critical and cryptographic security is not required.
+- **`siphash`:** Uses the SipHash1-3 algorithm for hashing the cache keys, the [default hashing algorithm of Rust](https://github.com/rust-lang/rust/commit/db1b1919baba8be48d997d9f70a6a5df7e31612a). This hashing algorithm is a secure algorithm that implements verified protections against ["hash flooding"](https://v8.dev/blog/hash-flooding) denial of service (DoS) attacks. Reasonably performant, and provides a high level of security.
 - **`ahash`:** Uses the [AHash](https://github.com/tkaitchuck/ahash) algorithm for hashing the cache keys. The AHash algorithm is a [high quality](https://github.com/tkaitchuck/aHash/blob/master/compare/readme.md#Quality) hashing algorithm, and has claimed resistance against hashing DoS attacks. AHash has higher performance than SipHash1-3, especially when used with `cache_key_type: plan`.
+- **`blake3`:** Uses the [BLAKE3](https://github.com/BLAKE3-team/BLAKE3) cryptographic hash function. BLAKE3 is a fast, parallelizable hash function that provides cryptographic security while maintaining high performance. It is suitable for scenarios requiring both speed and cryptographic guarantees.
+- **`xxh32`, `xxh64`, `xxh128`:** Variants of the XXH hashing algorithm with different output sizes. These algorithms offer a balance between speed and collision resistance, with larger hash sizes providing better collision resistance at the cost of performance.
 
-Consider using `ahash` if maximum performance is most important, or where hashing DoS attacks are unlikely or a low risk. More information on the security mechanisms of AHash are available [in the AHash documentation](https://github.com/tkaitchuck/aHash/wiki/How-aHash-is-resists-DOS-attacks).
+Use `xxh3` (the default) for its superior speed in most scenarios. Use `ahash`, `xxh64` or `xxh128` for reduced collision probability when caching a large number of queries. Use `blake3` when cryptographic security is required. Use `siphash` when protection against hash flooding attacks is a priority.
 
 ### Choosing an `encoding`
 
@@ -192,6 +195,16 @@ With this configuration:
 
 This approach is particularly useful for queries that take significant time to execute, providing a better user experience by reducing perceived latency while keeping data reasonably fresh.
 
+:::warning[Conflict with Caching Accelerator SWR]
+When using a dataset with `refresh_mode: caching`, you cannot configure both the results cache's `stale_while_revalidate_ttl` and the caching accelerator's `caching_stale_while_revalidate_ttl` for the same dataset. These parameters control similar behavior at different layers.
+
+Choose one approach:
+
+- **Results cache SWR**: Configure `runtime.caching.sql_results.stale_while_revalidate_ttl` for SQL query results caching
+- **Caching accelerator SWR**: Configure `acceleration.params.caching_stale_while_revalidate_ttl` for [HTTP-based dataset caching](/docs/features/data-acceleration/refresh-modes/caching)
+
+:::
+
 ### HTTP/Flight API
 
 The following endpoints support the standard HTTP [`Cache-Control` header](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control):
 
@@ -21,13 +21,18 @@ Acceleration data can be refreshed (updated) by:
 
 ## Refresh Modes
 
-Spice supports three modes to refresh/update local data from a connected data source. `full` is the default mode.
+Spice supports four modes to refresh/update local data from a connected data source. `full` is the default mode.
 
 | Mode      | Description                                          | Example                                                          |
 | --------- | ---------------------------------------------------- | ---------------------------------------------------------------- |
 | `full`    | Replace/overwrite the entire dataset on each refresh | A table of users                                                 |
 | `append`  | Append/add data to the dataset on each refresh       | Append-only, immutable datasets, such as time-series or log data |
 | `changes` | Apply incremental changes                            | Customer order lifecycle table                                   |
+| `caching` | Read-through caching for SQL queries                 | API search results or dynamic content endpoints                  |
+
+Learn more about each mode:
+
+- [Caching Mode](/docs/features/data-acceleration/refresh-modes/caching.md)
 
 Example:
 
@@ -125,6 +130,12 @@ Appending modified files is only supported for datasets that support setting the
 
 Datasets configured with acceleration `refresh_mode: changes` requires a [Change Data Capture (CDC)](/docs/features/cdc/index.md) supported data connector. Initial CDC support in Spice is supported by the [Debezium data connector](/docs/components/data-connectors/debezium.md).
 
+### Caching
+
+The `caching` refresh mode is designed for HTTP-based datasets where request metadata acts as cache keys. This mode is particularly useful for API responses that return multiple rows for a single request, such as search results or dynamic content endpoints.
+
+See [Caching Mode](/docs/features/data-acceleration/refresh-modes/caching.md) for detailed documentation and examples.
+
 ## Ready State
 
 |                             |           |
@@ -425,7 +436,7 @@ datasets:
     acceleration:
       enabled: true
       refresh_mode: full
-      refresh_cron: "0 12 * * 1-5"
+      refresh_cron: '0 12 * * 1-5'
 ```
 
 This configuration will refresh `taxi_trips` data at midday every weekday. For more information about cron schedules, see the [cron schedule reference](/docs/reference/cron.md).