-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
[RFC] Filter catalog by ts_event instead of ts_init
When querying the catalog with a start / end time range, currently the filtering is done using ts_init. This can cause the following undesirable situation (applies to all data types):
Steps:
- Request data for the last week and write it to the catalog (
update_catalog=True). - The data client fetches the data today, so
ts_initis stamped as today. - Re-run the same request for the last week.
- The catalog correctly detects that data exists (based on filenames using
ts_event). - However, when loading the data, it is filtered by start / end using
ts_init. - Since
ts_initis today (outside the requested range), no data is returned, even though the data actually exists.
This is not so much of a problem when importing external data (running it through a wrangler) and writing to the catalog, since then it is possible to tag ts_init the same as ts_event. But when requesting data from a live data client, the ts_init is (rightfully so) tagged as the current time (to preserve latency info) for all the data, and if update_catalog=True is used, that is written to the catalog.
Is there any reason why the catalog uses ts_init for filtering rather than ts_event, because currently I can't think of anything, except the problem that it causes described above?
| filters.append(pds.field("ts_init") >= used_start.value) |