docs for rc5

Jeadie · Jeadie · commit 6612ff01acfd · 2026-05-18T12:17:08.000+10:00
diff --git a/website/docs/components/data-accelerators/arrow/index.md b/website/docs/components/data-accelerators/arrow/index.md
@@ -53,6 +53,33 @@ datasets:
 
 See [Hash Index](../../features/data-acceleration/hash-index) for configuration details, supported data types, and performance characteristics.
 
+## Native Upserts with Primary Key Matching
+
+Spice supports efficient upsert (update-or-insert) operations on Arrow-accelerated tables using primary key matching. When a dataset is accelerated with Arrow and a `primary_key` is specified, incoming rows with matching primary key values will update existing records; otherwise, new records are inserted.
+
+### Example Upsert Configuration
+
+```yaml
+datasets:
+  - from: s3://bucket/orders.parquet
+    name: orders
+    acceleration:
+      engine: arrow
+      primary_key: order_id
+```
+
+- When you insert or load data, if a row's `order_id` matches an existing record, the record is updated in-place.
+- If the `order_id` is new, a new record is inserted.
+
+This enables efficient update-or-insert semantics for in-memory datasets, ideal for CDC, streaming, and real-time analytics workloads.
+
+### Notes
+- Upsert support requires a defined `primary_key`.
+- Upserts are performed in-memory and are not persisted after runtime shutdown.
+- For persistent upserts, use a persistent accelerator (e.g., DuckDB, Cayenne).
+
+---
+
 ## Limitations
 
 - The In-Memory Arrow Data Accelerator does not support persistent storage. Data is stored in-memory and will be lost when the Spice runtime is stopped.
diff --git a/website/docs/components/data-accelerators/cayenne/index.md b/website/docs/components/data-accelerators/cayenne/index.md
@@ -46,6 +46,22 @@ For optimal performance, store Cayenne data files on NVMe storage. NVMe provides
 
 Use [S3 Express One Zone](#aws-s3-express-one-zone-storage) when persistence of accelerations across restarts is required. S3 Express One Zone adds network latency compared to local NVMe but provides durability. Sharing accelerated data across multiple Spice instances is planned for a future release.
 
+## Advanced Internals
+
+### Sequence-based Upserts and Deletes
+Cayenne uses Iceberg-style sequence numbers to enable upsert and delete semantics. Each row is tagged with a sequence number, allowing efficient handling of row-level changes without rewriting entire files. Deletes are tracked as tombstones, and upserts are resolved at query time.
+
+### Metadata Management
+Cayenne maintains in-process metadata for fast query planning. Metadata includes file listings, statistics, and sequence maps. This enables fast discovery and pruning of data files during query execution.
+
+### Persistent Acceleration
+Cayenne stores acceleration data on NVMe or S3 Express One Zone. Acceleration state is durable across restarts, and future releases will support sharing acceleration across multiple Spice instances.
+
+### Vortex Format
+Cayenne leverages the Vortex columnar format for zero-copy Arrow compatibility, fast random access, and extensible encoding.
+
+---
+
 ## Configuration
 
 To use Spice Cayenne as the data accelerator, specify `cayenne` as the `engine` for acceleration. Spice Cayenne supports `mode: file`, `mode: file_create`, and `mode: file_update` and stores data on disk.
diff --git a/website/docs/components/data-connectors/mongodb.md b/website/docs/components/data-connectors/mongodb.md
@@ -27,6 +27,41 @@ datasets:
 
 ## Configuration
 
+### Real-Time Change Data Capture (CDC) with MongoDB Change Streams
+
+Spice supports real-time Change Data Capture (CDC) from MongoDB using native [MongoDB Change Streams](https://www.mongodb.com/docs/manual/changeStreams/). This enables streaming inserts, updates, and deletes from your MongoDB collections directly into Spice accelerators, without requiring Debezium or Kafka.
+
+#### Enabling CDC with `refresh_mode: changes`
+
+To enable real-time CDC, set `refresh_mode: changes` in your dataset configuration:
+
+```yaml
+datasets:
+  - from: mongodb:my_collection
+    name: my_collection
+    params:
+      host: my-cluster.mongodb.net
+      db: mydb
+    acceleration:
+      enabled: true
+      engine: duckdb
+      refresh_mode: changes
+```
+
+- `refresh_mode: changes` tells Spice to use MongoDB Change Streams for this dataset.
+- No Debezium or Kafka is required—Spice connects directly to MongoDB.
+- Changes are streamed in real time into the configured accelerator (e.g., DuckDB, Arrow).
+
+#### Use Cases
+- Real-time analytics on operational data
+- Low-latency dashboards and event-driven pipelines
+
+#### Notes
+- Requires MongoDB 4.0+ and a replica set or sharded cluster.
+- Ensure your MongoDB user has `changeStream` privileges.
+
+---
+
 ### `from`
 
 The `from` field takes the form `mongodb:{table_name}` where `table_name` is the table identifer in the MongoDB server to read from.
diff --git a/website/docs/features/cdc/index.md b/website/docs/features/cdc/index.md
@@ -29,6 +29,20 @@ It is recommended to use CDC-accelerated datasets with persistent data accelerat
 
 :::
 
+## Kafka CDC Offset Persistence
+
+Spice now persists Kafka CDC offsets in sidecar tables, enabling durable and resumable CDC streams. When consuming from Kafka topics, Spice records the last committed offset for each partition in a dedicated sidecar table. On restart or failover, Spice resumes from the last committed offset, ensuring no data loss or duplicate processing.
+
+### Benefits
+- Durable CDC: Survives restarts and failover without replaying the entire topic.
+- Fast recovery: Resumes from the last processed event, not the earliest available.
+- No external offset store required.
+
+### Example
+No special configuration is required—offset persistence is automatic for all Kafka CDC datasets.
+
+---
+
 ## Supported Data Connectors
 
 Enabling CDC by setting `refresh_mode: changes` in the acceleration settings requires support from the data connector to provide a stream of row-level changes.
diff --git a/website/docs/features/observability/index.md b/website/docs/features/observability/index.md
@@ -22,6 +22,20 @@ Spice provides monitoring and observability through three mechanisms:
 - [New Relic](../monitoring/new-relic)
 - [Zipkin](../monitoring/zipkin)
 
+## HTTP Rate-Control Persistence
+
+Spice persists HTTP rate-control (rate-limiting) state in object storage, ensuring that per-endpoint throttle counters survive restarts and are consistent across replicas. This enables reliable rate-limiting for all HTTP endpoints, including `/metrics`, even in distributed or containerized deployments.
+
+### Key Features
+- Persistent rate-limiting: Throttle state is saved to object storage and restored on restart.
+- Consistent across replicas: All instances share the same rate-limit state.
+- `/metrics` endpoint is independently rate-limited to prevent scraping from impacting query serving.
+
+### Usage
+No special configuration is required—rate-control persistence is enabled by default when object storage is configured for the runtime.
+
+---
+
 ## Prometheus Metrics Endpoint
 
 Spice exposes a Prometheus-compatible metrics endpoint that monitoring systems can scrape. The endpoint serves metrics in the [Prometheus exposition format](https://prometheus.io/docs/instrumenting/exposition_formats/), which is supported by most enterprise monitoring platforms including Datadog, New Relic, Chronosphere, Grafana Cloud, and others.