Skip to content

scaffold_and_refine_multitaxa usability improvements: try obtaining e-mail address via Terra env introspection; allow taxid_to_ref_accessions_tsv to originate from an http[s] URL #591

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

tomkinsc
Copy link
Member

This makes a couple small usability-related changes to the scaffold_and_refine_multitaxa workflow in response to observed friction points:

  • the workflow will attempt to obtain the e-mail address of the active user by introspection of the execution environment iff running on Terra, while also allowing for an overriding input. This e-mail address is sent when making requests to download reference genomes from NCBI. The tasks_ncbi.wdl::download_annotations task should ideally also be updated at some point to accept an NCBI API key as an alternative to providing an e-mail address.
  • the workflow will accept an input URL for taxid_to_ref_accessions_tsv that is from an http[s] source; the (newly-added) download_file task will allow a path specified using gs:// or http[s], and download the latter or pass-through the former. This was added with public http[s]-accessible databases in mind, such as the reference genome list maintained in the repo broadinstitute/viral-references.

Added:

  • download_from_url task: only download https[s] URLs, and pass non-http[s] input urls through directly to the output for consumption downstream in tasks that can localize Files addressable via the "non-http protocols" (i.e. gs://, drs://, etc.). The task does this simply by checking the URL prefix/protocol, but we would ideally decide to download or not based on introspection of the execution engine and its localization capabilities or configuration. After calling download_from_url, downstream tasks can consume http[s] (or gs:// etc. paths) by selecting which output of download_from_url is defined:
select_first([download_from_url.downloaded_response_file, download_from_url.passthrough_url])
  • a new workflow, download_file: calls the download_from_url task. This may be useful in isolation, but it primarily allows for the functionality described above to be abstracted into a sub-workflow, with the result accessed via a single output, download_file.file_path

tomkinsc added 27 commits March 21, 2025 14:47
…e by introspection if possible

This changes `scaffold_and_refine_multitaxa` workflow so that rather than having `emailAddress` as a required input, the e-mail address of the active user is obtained by introspection of the execution environment iff running on Terra
…tp[s] input url to output (i.e. gs://, drs://, etc.) for direct consumption downstream

This changes the task `download_from_url` to only download https[s] URLs; non-http[s] input urls will be passed through directly to the output for direct consumption downstream in tasks that can localize such protocols (i.e. gs://, drs://, etc.). The task does this simply by checking the URL prefix/protocol, but we would ideally decide to download based on introspection of the executor and its localization capabilities and configuration. After calling `download_from_url`, downstream tasks can then consume http[s] (or gs:// etc. paths) by selecting which output of `download_from_url` is defined:
`select_first([download_from_url.downloaded_response_file, download_from_url.passthrough_url])`
This was added with public http[s]-accessible databases in mind, such as the reference genome list from `broadinstitute/viral-references` used for `scaffold_and_refine_multitaxa`.
This also adds a new workflow, `download_file`, to call the task separately from invocation in other workflows.
… to allow the workflow to consume `taxid_to_ref_accessions_tsv` input specified from either a `gs://` or `http[s]` source

`scaffold_and_refine_multitaxa`workflow: use the `download_file` task to allow the workflow to consume its `taxid_to_ref_accessions_tsv` input from a path specified using `gs://` *or* `http[s]`.
debugging download_from_url delocalization on Terra
in attempt to resolve "Failed to predict files needed to de-localize from 'read_string'" error occurring *before* task execution
include an optional File? in flatten() call to appease dxWDL and prevent the error:
```Failed to process task definition 'download_from_url' (reason 1 of 1): Failed to process expression 'if read_boolean("WAS_HTTP_DOWNLOAD") then select_first(flatten([glob((download_subdir_local + "/*")), [""]])) else nullStrPlaceholder' (reason 1 of 1): Invalid parameter 'Flatten(ArrayLiteral(Vector(Glob(Add(IdentifierLookup(download_subdir_local),StringLiteral(/*))), ArrayLiteral(Vector(StringLiteral())))))'. Expected an array of optional values (eg 'Array[X?]') but got 'Array[String]')```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant