Skip to content

[RFC] Filter catalog by ts_event instead of ts_init #3491

@dxwil

Description

@dxwil

[RFC] Filter catalog by ts_event instead of ts_init

When querying the catalog with a start / end time range, currently the filtering is done using ts_init. This can cause the following undesirable situation (applies to all data types):

Steps:

  • Request data for the last week and write it to the catalog (update_catalog=True).
  • The data client fetches the data today, so ts_init is stamped as today.
  • Re-run the same request for the last week.
  • The catalog correctly detects that data exists (based on filenames using ts_event).
  • However, when loading the data, it is filtered by start / end using ts_init.
  • Since ts_init is today (outside the requested range), no data is returned, even though the data actually exists.

This is not so much of a problem when importing external data (running it through a wrangler) and writing to the catalog, since then it is possible to tag ts_init the same as ts_event. But when requesting data from a live data client, the ts_init is (rightfully so) tagged as the current time (to preserve latency info) for all the data, and if update_catalog=True is used, that is written to the catalog.

Is there any reason why the catalog uses ts_init for filtering rather than ts_event, because currently I can't think of anything, except the problem that it causes described above?

filters.append(pds.field("ts_init") >= used_start.value)

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCA request for comment

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions