Skip to content

Commit c3aae08

Browse files
authored
Improvements to clarity of data refresh (#387)
1 parent 3696a80 commit c3aae08

1 file changed

Lines changed: 24 additions & 17 deletions

File tree

spiceaidocs/docs/components/data-accelerators/data-refresh.md

Lines changed: 24 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -11,13 +11,13 @@ pagination_next: null
1111

1212
Spice supports three modes to refresh/update local data from a connected data source. `full` is the default mode.
1313

14-
| Mode | Description | Example |
15-
| --------- | ---------------------------------------------------- | -------------------------------------------------- |
16-
| `full` | Replace/overwrite the entire dataset on each refresh | A table of users |
17-
| `append` | Append/add data to the dataset on each refresh | Append-only datasets, like time-series or log data |
18-
| `changes` | Apply incremental changes | Customer order lifecycle table |
14+
| Mode | Description | Example |
15+
| --------- | ---------------------------------------------------- | ---------------------------------------------------------------- |
16+
| `full` | Replace/overwrite the entire dataset on each refresh | A table of users |
17+
| `append` | Append/add data to the dataset on each refresh | Append-only, immutable datasets, such as time-series or log data |
18+
| `changes` | Apply incremental changes | Customer order lifecycle table |
1919

20-
E.g.
20+
Example:
2121

2222
```yaml
2323
datasets:
@@ -28,7 +28,9 @@ datasets:
2828
refresh_check_interval: 10m
2929
```
3030
31-
If the dataset definition includes a `time_column` and the refresh mode is `append`, data will be refreshed for data where the `time_column` value in the remote source is greater-than (gt) the `max(time_column)` value in the local acceleration.
31+
### Append
32+
33+
If the dataset definition includes a `time_column` and the refresh mode is `append`, data will be incrementally refreshed for data where the `time_column` value in the remote source is greater-than (gt) the `max(time_column)` value in the local acceleration.
3234

3335
E.g.
3436

@@ -42,7 +44,9 @@ datasets:
4244
refresh_check_interval: 10m
4345
```
4446

45-
## Changes
47+
When using `mode: append`, if late arriving data or clock-skew needs to be accounted for, an optional overlap can also be specified. See [`acceleration.refresh_append_overlap`](/reference/spicepod/datasets#accelerationrefresh_append_overlap).
48+
49+
### Changes (CDC)
4650

4751
Datasets configured with acceleration `refresh_mode: changes` require a [Change Data Capture (CDC)](/features/cdc/index.md) supported data connector. Initial CDC support in Spice is supported by the [Debezium data connector](/components/data-connectors/debezium.md).
4852

@@ -57,7 +61,7 @@ Typically only a working subset of an entire dataset is used in an application o
5761

5862
Specify filters for data accelerated from the connected source using arbitrary SQL. Supported for `full` and `append` refresh modes.
5963

60-
Filters will be pushed down to the remote source, and only the requested data will be transferred over the network.
64+
Filters will be pushed down to the remote source when possible, so only the requested data will be transferred over the network.
6165

6266
Example:
6367

@@ -73,7 +77,7 @@ datasets:
7377
SELECT * FROM accelerated_dataset WHERE city = 'Seattle'
7478
```
7579

76-
The `refresh_sql` parameter can be updated at runtime on-demand using `PATCH /v1/datasets/:name/acceleration`. This change is temporary and will revert at the next runtime restart.
80+
The `refresh_sql` parameter can be updated at runtime on-demand using `PATCH /v1/datasets/:name/acceleration`. This change is temporary and will revert to the `spicepod.yml` definition at the next runtime restart.
7781

7882
Example:
7983

@@ -90,20 +94,20 @@ For the complete reference, view the `refresh_sql` section of [datasets](/refere
9094

9195
:::warning[Limitations]
9296

93-
- The refresh SQL only supports filtering data from the current dataset - joining across other datasets is not supported.
97+
- Refresh SQL only supports filtering data from the current dataset - joining across other datasets is not supported.
9498
- Selecting a subset of columns isn't supported - the refresh SQL needs to start with `SELECT * FROM {name}`.
95-
- Queries for data that have been filtered out will not fall back to querying against the federated table.
99+
- Queries for data that have been filtered out will not fallback to querying the federated table.
96100
- Refresh SQL modifications made via API are temporary and will revert after a runtime restart.
97101

98102
:::
99103

100104
### Refresh Data Window
101105

102-
Filters data from the federated source outside than the specified window. The only supported window is a lookback starting from `now() - refresh_data_window` to `now()`. This flag is only supported for datasets configured with a `full` refresh mode (the default).
106+
Filters data from the federated source that falls outside the specified time window. The only supported window is a lookback period starting from `now() - refresh_data_window` to `now()`. This flag is supported datasets configured with the default `full` refresh mode.
103107

104-
Used in combination with the [`time_column`](/reference/spicepod/datasets.md#time_column) to identify the column that contains the timestamps to filter on. The [`time_format`](/reference/spicepod/datasets.md#time_format) column (optional) can be used to instruct the Spice runtime how to interpret the timestamps in the `time_column`.
108+
This filter works with the `time_column` to identify the column containing timestamps for filtering. Optionally, the `time_format` can be specified to instruct the Spice runtime on how to interpret timestamps in the `time_column`.
105109

106-
Can also be combined with `refresh_sql` to further filter the data based on the temporal dimension.
110+
It can also be used alongside `refresh_sql` to apply additional filtering based on time-related criteria.
107111

108112
Example:
109113

@@ -243,9 +247,11 @@ Retention policies apply to `full` and `append` refresh modes (not `changes`).
243247

244248
The policy is set using the [`acceleration.retention_check_enabled`](/reference/spicepod/datasets#accelerationretention_check_enabled), [`acceleration.retention_period`](/reference/spicepod/datasets#accelerationretention_period) and [`acceleration.retention_check_interval`](/reference/spicepod/datasets#accelerationretention_check_interval) parameters, along with the [`time_column`](/reference/spicepod/datasets#time_column) and [`time_format`](/reference/spicepod/datasets#time_format) dataset parameters.
245249

246-
247250
## Refresh Jitter
248-
Accelerated datasets can be configured to add a random jitter to the refresh interval. This can be useful to avoid a thundering herd problem where multiple datasets are refreshed at the same time. The jitter is added or subtracted from the refresh interval and is between 0 and `refresh_jitter_max`.
251+
252+
Accelerated datasets can include a random jitter in the refresh interval to prevent the [Thundering herd problem](https://en.wikipedia.org/wiki/Thundering_herd_problem), where multiple datasets refresh simultaneously. The jitter, ranging from `0` to `refresh_jitter_max`, is randomly added or subtracted from the refresh interval.
253+
254+
Refresh Jitter applies on the first dataset load, so on a restart of multiple similarily configured Spice instances at once, on restart they will load with jitter of 0 to `refresh_jitter_max`.
249255

250256
Example:
251257

@@ -262,5 +268,6 @@ datasets:
262268
In this example, the refresh interval will be between 9s and 11s.
263269

264270
Refresh jitter can be configured using the following parameters:
271+
265272
- [`refresh_jitter_enabled`](/reference/spicepod/datasets#accelerationrefresh_jitter_enabled)
266273
- [`refresh_jitter_max`](/reference/spicepod/datasets#accelerationrefresh_jitter_max)

0 commit comments

Comments
 (0)