Skip to content

storage endpoint to order assets by materialization timestamp #29173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

jamiedemaria
Copy link
Contributor

@jamiedemaria jamiedemaria commented Apr 10, 2025

Summary & Motivation

From the FE we want to be able to order the assets in the Asset Catalog by the time they were most recently materialized/observed/failed to materialize.

Existing query pattern in the FE:

  • fetch and cache 10000 asset nodes
  • display whichever assets are in the selected catalog view, fetching more asset nodes if needed

Rather than modifying the query that fetches all assets from the DB to return in time-sorted order (which would require updating cursor logic, etc) this PR adds an endpoint that takes a list of asset keys and returns that list of asset keys in sorted order. This way we don't have to refetch the asset nodes when a user switches from time-ordered to alphabetical sorting.

New query pattern would be:

  • fetch and cache 10000 asset nodes
  • if doing time-sorted order, query to get the asset keys in time-sorted order
  • display the assets, fetching more asset nodes if needed

The endpoint also has an option to sort all asset nodes to cover the case when no catalog view is selected.

Sorting caveat:

  • right now the latest_materialization_timestamp doesn't get updated on FAILED_TO_MATERIALIZE events, which means the asset would be sorted by the corresponding PLANNED event. We'll likely want to change this in the future so that the failure events are used in the sorting, but are fine with this inconsistency for the internal observe release

internal pr https://github.com/dagster-io/internal/pull/14946

How I Tested These Changes

Changelog

Insert changelog entry or delete this section.

Copy link
Contributor Author

@jamiedemaria jamiedemaria changed the title poc asset ordering [poc] asset ordering by timestamp Apr 10, 2025
@jamiedemaria jamiedemaria changed the title [poc] asset ordering by timestamp storage endpoint to order assets by materialization timestamp Apr 14, 2025
@jamiedemaria jamiedemaria marked this pull request as ready for review April 14, 2025 16:28
@jamiedemaria jamiedemaria requested review from prha and gibsondan April 14, 2025 16:29
Copy link
Member

@gibsondan gibsondan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you already have the 10k nodes on the frontend, i'm confused why you would need to re-order them on the backend? Could the sorting be done on the frontend instead using information gained from the original query?

Copy link
Member

@gibsondan gibsondan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i might be missing something. But it seems like if this information is already available on the asset record, we could get it the first time it was queried and keep it ready if the user reorders things

If we are going to order by on the backend, I think we will want an index for that, even if we are passing in the list of keys.

Copy link
Member

@gibsondan gibsondan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a path forward here is to start including the last_materialization_timestamp information on the query that returns the original asset nodes? so that it is immediately available to use for sorting on the frontend?

@jamiedemaria
Copy link
Contributor Author

Chatted with @salazarm about having the timestamp on the AssetNode and then doing sorting on the FE and he mentioned that it would involve updating a run watching mechanism to start watching in progress and unstarted runs so that we know which asset nodes to refetch (to get updated timestamps) and that that could also be slow. i'll let marco weigh in though because i'm not sure i explained that totally right

@gibsondan
Copy link
Member

gibsondan commented Apr 14, 2025

i think even in that world it would still make more sense to add this timestamp as a field on AssetNode that gets refetched periodically using the existing resolvers, rather than have a custom endpoint for re-ordering a known set of existing asset keys


query = (
db_select([AssetKeyTable.c.asset_key])
.order_by(order_by)
Copy link
Member

@gibsondan gibsondan Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the key thing is that doing this order by on the backend/in postgres only makes sense if postgres knows more about how to order it or paginate it efficiently than the frontend or caller does, which isn't the case here since we don't have an index

@salazarm
Copy link
Contributor

salazarm commented Apr 14, 2025

Yeah so we have a mechanism to watch in progress runs for runs that are launched by the user, but it requires also fetching inProgressRunIds. But yeah I agree with Daniel this doesn't need to be a new resolver. But also GrapheneAssetNode doesn't work for external assets to I think we probably want this on GrapheneAsset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants