Skip to content

LatestAt queries need a mechanism to provide the context of the origin time #9819

Open
@jleibs

Description

@jleibs

Context

LatestAt queries are very powerful, but they currently materialize data somewhat magically, without providing a mechanism to actually know where that data came from. This can be important for use-cases like:

  • Doing additional queries at that specific time-point once you know "when" something happened.
    • A common pattern is for a user to log "events" when interesting things happen. It's easy to write a query that finds the last instance of an event before a particular time. But once you find the event you have no way of knowing when the event actually happened.
  • Understanding how old data is relative to the query time
  • Implementing interpolation functions

Example

Probably best articulated with an example.

Consider the table:

+----------+------------+-----------+
| log_tick | /my_scalar | /my_color |
+----------+------------+-----------+
|    1     |   1.2      |   RED     |
|    3     |   3.4      |    -      |
|    5     |   2.7      |   BLUE    |
|    7     |   4.1      |    -      |
+----------+------------+-----------+

When using fill_latest_at() it can be helpful to be able to answer the question: "From which time was the latest_at data filled it".

It should be possible to do something like the following:

dataset.dataframe_query_view(index="log_tick", contents="/**")
       .using_index_values([4, 10])
       .fill_latest_at(with_origin_time=True)
       .df()
       .select("log_tick",
               "/my_scalar",
               "/my_scalar:log_tick",
               "/my_color",
               "/my_color:log_tick")

Or maybe only materialize these on-demand if there was a way to articulate this as a udf?

dataset.dataframe_query_view(index="log_tick", contents="/**")
       .using_index_values([4, 10])
       .fill_latest_at()
       .df()
       .select("log_tick",
               "/my_scalar",
               udf.origin_time("/my_scalar"),
               "/my_color",
               udf.origin_time("/my_color"))

The expected output would be:

+----------+------------+---------------------+-----------+---------------------+
| log_tick | /my_scalar | /my_scalar:log_tick | /my_color | /my_color:log_tick  |
+----------+------------+---------------------+-----------+---------------------+
|    4     |   3.4      |          3          |   RED     |          1          |
|   10     |   4.1      |          7          |   BLUE    |          5          |
+----------+------------+---------------------+-----------+---------------------+

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestfeat-dataframe-apiEverything related to the dataframe API

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions