Skip to content

Data lineage channel factories #6003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft

Data lineage channel factories #6003

wants to merge 4 commits into from

Conversation

jorgee
Copy link
Contributor

@jorgee jorgee commented Apr 24, 2025

Initial fromLineage factory implementation

Copy link

netlify bot commented Apr 24, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit cdd9e89
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/680be16db7ea9d0008b38953

@bentsherman bentsherman changed the title Lineage factory channel factory Data lineage channel factories Apr 25, 2025
@bentsherman
Copy link
Member

bentsherman commented Apr 25, 2025

I think we can have a nice unification of the CLI and programmatic API here

Viewing a single LID:

  • CLI: view lid://<hash>[/<path>] -> use jq to inspect further
  • API: channel.fromLineage('lid://<hash>[/<path>]') -> use json-path param to inspect further

Querying a collection of LIDs:

  • CLI: find <name=value> <name=value> ...
  • API: channel.queryLineage(foo: 'bar', baz: 'qux') -> returns queue channel of items, use json-path param to inspect further

This analysis suggests that the json-path functionality should be provided as a standalone function so that it can be used in different ways. You can add it to Nextflow.groovy. Actually you could just add it as an extra param to both APIs if that would be easier.

It might also make sense to refactor channel.fromLineage as a function instead of a channel factory, e.g. lineage(lid), because as a channel factory it will always return a value channel which is a needless constraint. The queryLineage factory makes sense because it can emit results while it's querying.

Finally, I think that find / queryLineage should really just use key-value pairs, rather than the URI query syntax which is unnecessary and error-prone

@jorgee
Copy link
Contributor Author

jorgee commented Apr 25, 2025

I have pushed what I showed today.

  • Files published with PublishDir included in outputs
  • fromLinage channel factory: It should return the same as CLI view command:
    Usage: Channel.fromLineage("lid://xxxx")
    -queryLineage channel factory: It return a channel with the LinPaths matching with the queryString. Almost the same as the CLI find command
    Usage: Channel.queryString("type=FileOutput&annotations.value=test")

future.exceptionally(this.&handlerException)
}

static DataflowWriteChannel queryLineage(String queryString) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just use a Map here instead of query string. I would apply the same change to the find command as well. There is no need to add the extra complexity of URL encoding

@@ -381,7 +381,7 @@ class Session implements ISession {
this.dag = new DAG()

// -- init output dir
this.outputDir = FileHelper.toCanonicalPath(config.outputDir ?: 'results')
this.outputDir = FileHelper.toCanonicalPath(config.outputDir ?: config.navigate('params.outdir') ?: 'results')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this PR focused on the channel factories, we can address publishDir support separately

return filePattern instanceof QueryablePath && (filePattern as QueryablePath).hasQuery()
}

private boolean applyQueryablePath0(QueryablePath path) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to add query support to fromPath because it will already be supported through queryLineage. Besides I would like to get away from using URI query params everywhere.

The queryLineage factory should return a channel of metadata objects, so you can chain it with a map operator to extract the actual files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants