|
| 1 | +--- |
| 2 | +title: 'Append Refresh Mode' |
| 3 | +sidebar_label: 'Append' |
| 4 | +description: 'Incrementally append new rows to an accelerated dataset.' |
| 5 | +sidebar_position: 2 |
| 6 | +pagination_prev: null |
| 7 | +pagination_next: null |
| 8 | +--- |
| 9 | + |
| 10 | +The `append` refresh mode incrementally adds new rows to the acceleration on each refresh. It is designed for append-only or immutable datasets such as time-series, event, and log data. |
| 11 | + |
| 12 | +Use `append` when: |
| 13 | + |
| 14 | +- New rows are continuously added to the source and existing rows are not modified or deleted. |
| 15 | +- A monotonic time or sequence column is available to identify new rows. |
| 16 | +- The full dataset is too large to refresh in `full` mode on each interval. |
| 17 | + |
| 18 | +## Configuration |
| 19 | + |
| 20 | +`append` mode requires a [`time_column`](../../../reference/spicepod/datasets#time_column) that identifies new rows by comparing the local maximum value to the source. Data is incrementally refreshed where `time_column` in the source is greater than `max(time_column)` in the acceleration. |
| 21 | + |
| 22 | +```yaml |
| 23 | +datasets: |
| 24 | + - from: databricks:my_dataset |
| 25 | + name: accelerated_dataset |
| 26 | + time_column: created_at |
| 27 | + acceleration: |
| 28 | + enabled: true |
| 29 | + refresh_mode: append |
| 30 | + refresh_check_interval: 10m |
| 31 | +``` |
| 32 | +
|
| 33 | +## Late-Arriving Data |
| 34 | +
|
| 35 | +To account for clock skew or late-arriving rows, configure an overlap window with [`acceleration.refresh_append_overlap`](../../../reference/spicepod/datasets#accelerationrefresh_append_overlap). Rows within the overlap are re-read on each refresh. |
| 36 | + |
| 37 | +## Partition Pruning with `time_partition_column` |
| 38 | + |
| 39 | +Datasets partitioned by a less-granular time column (day, month, year) can specify [`time_partition_column`](../../../reference/spicepod/datasets#time_partition_column) in addition to `time_column` for efficient partition pruning at the source. |
| 40 | + |
| 41 | +```yaml |
| 42 | +datasets: |
| 43 | + - from: databricks:my_dataset |
| 44 | + name: accelerated_dataset |
| 45 | + time_column: created_at |
| 46 | + time_format: iso8601 |
| 47 | + time_partition_column: created_at_day |
| 48 | + time_partition_format: date |
| 49 | +``` |
| 50 | + |
| 51 | +## Append Only Modified Files |
| 52 | + |
| 53 | +For object-store sources, set `time_column` or `time_partition_column` to the special value `last_modified` to append only newly created or updated files. Spice uses file metadata to determine which files are new, dramatically reducing scan time for large datasets. |
| 54 | + |
| 55 | +```yaml |
| 56 | +datasets: |
| 57 | + - from: s3://my_bucket/my_dataset |
| 58 | + name: accelerated_dataset |
| 59 | + time_column: last_modified |
| 60 | + params: |
| 61 | + file_format: parquet |
| 62 | + acceleration: |
| 63 | + refresh_mode: append |
| 64 | + refresh_check_interval: 10m |
| 65 | +``` |
| 66 | + |
| 67 | +If `last_modified` exists as a column in the data, the column value takes precedence over file metadata. |
| 68 | + |
| 69 | +This is supported for connectors that accept the [file format parameter](../../../reference/file_format), such as `s3://`, `abfs://`, and `file://`. |
| 70 | + |
| 71 | +## Readiness with Snapshots |
| 72 | + |
| 73 | +Append-mode accelerations that define a `time_column` wait to report ready until the first append refresh completes after [snapshot bootstrap](../snapshots). This keeps the dataset out of rotation until the freshest data is available while still benefiting from snapshot-assisted startup. |
| 74 | + |
| 75 | +## Combining with Upserts |
| 76 | + |
| 77 | +Pair `refresh_mode: append` with a `primary_key` and `on_conflict: upsert` to handle source rows that are occasionally updated. See [End-to-End Incremental Ingestion Example](../data-refresh#end-to-end-incremental-ingestion-example). |
| 78 | + |
| 79 | +## Related Topics |
| 80 | + |
| 81 | +- [Refresh Interval](../data-refresh#refresh-interval) |
| 82 | +- [Refresh on Startup](../data-refresh#refresh-on-startup) |
| 83 | +- [Refresh Retries](../data-refresh#refresh-retries) |
| 84 | +- [Retention Policy](../data-refresh#retention-policy) |
| 85 | +- [Refresh Data Window](../data-refresh#refresh-data-window) |
0 commit comments