-
Notifications
You must be signed in to change notification settings - Fork 37
Structured parameters and outputs #226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
1. Modified workflow_params description to indicate it's required unless workflow_unified_params is provided 2. Added workflow_unified_params field with the hybrid structure we discussed 3. Enhanced file metadata support with optional fields for size, checksum, secondary files, format, and modification time 4. Added comprehensive validation constraints for different parameter types 5. Validated the schema - no OpenAPI validation issues detected Key Features of the Implementation: - Version field for format evolution (default: 1.0) - Rich file metadata (size, checksum, secondary_files, format, last_modified) - Comprehensive validation constraints (min/max, length, pattern, enum, array limits) - Type-safe parameter definitions with clear enums - Backward compatibility - existing workflow_params still works - Precedence handling - workflow_unified_params takes precedence when provided
Key Improvements Made: 1. Dual Content Type Support - application/json (Recommended): Uses the proper RunRequest model object - multipart/form-data (Legacy): Maintains backward compatibility for file uploads 2. Proper Model Usage - JSON requests now use $ref: '#/components/schemas/RunRequest' - Leverages all the rich typing and validation from the RunRequest schema - Supports both workflow_params and workflow_unified_params 3. Enhanced Documentation - Clear guidance on when to use each content type - Explains file handling differences between formats - Documents the new unified parameter format - Security considerations for file uploads 4. Better Developer Experience - OpenAPI tooling can generate proper client code for JSON requests - Type safety with structured objects instead of string parsing - Validation happens automatically with the model schema - Consistency across the API Usage Examples: Preferred JSON format: POST /runs Content-Type: application/json { "workflow_type": "CWL", "workflow_type_version": "v1.0", "workflow_url": "https://example.com/workflow.cwl", "workflow_unified_params": { "version": "1.0", "parameters": { "input_file": { "type": "File", "value": "gs://bucket/input.fastq", "file_metadata": { "size": 1073741824, "checksum": "sha256:abc123..." } } } } } Legacy multipart format (when file uploads needed): POST /runs Content-Type: multipart/form-data workflow_type: CWL workflow_unified_params: {"version":"1.0","parameters":{...}} workflow_attachment: [binary file data]
1. Updated RunLog schema - Added structured_outputs field alongside the existing outputs 2. Added WorkflowOutputs schema - Main container for structured outputs with version and metadata 3. Added OutputObject schema - Flexible output type supporting Files, Directories, Arrays, and primitives 4. Added documentation tags - Both schemas appear in the Models section of the API docs Key Features Implemented: WorkflowOutputs Schema: - Version field for format evolution - Named outputs with rich metadata - Workflow-level metadata (execution ID, timing, resource usage) - Provenance tracking (engine, version, status) OutputObject Schema: - Type system - File, Directory, Array, String, Integer, Float, Boolean - File metadata - location, size, checksum, format, basename - Provenance - source task, command, creation time - Secondary files - Associated files like indexes - Array support - Collections of outputs - Content embedding - Small file contents can be included Backward Compatibility: - Existing outputs field remains unchanged (marked as "legacy format") - structured_outputs is optional - implementations can provide either or both - No breaking changes to existing API consumers
…ondary workflow URLs beyond the primary workflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request introduces structured parameters and outputs to the Workflow Execution Service (WES) API specification while maintaining backward compatibility. The update enhances workflow type and engine support, introduces universal parameterization, and provides structured output collection.
Key changes:
- Added support for additional workflow types (Nextflow, Snakemake) beyond CWL and WDL
- Introduced
workflow_unified_params
for universal parameter format across workflow engines - Added
structured_outputs
to provide rich metadata and type-safe output collection - Enhanced service-info endpoint with workflow engine information and improved documentation
Co-authored-by: Copilot <[email protected]>
…low-execution-service-schemas into feature/issue-176-wes-params
Co-authored-by: Copilot <[email protected]>
Just dropping in this from @suecharo / @inutano - Sapporo extended WES in a few ways - https://petstore.swagger.io/?url=https://raw.githubusercontent.com/sapporo-wes/sapporo-service/main/sapporo-wes-spec-2.0.0.yml One thing that is interesting in their extensions is adding workflow_attachments_obj, which makes it, I think, somewhat easier to specify the destination filename of attachments. Perhaps going even further would be useful and specifying them are File or Directory types. |
description: Whether this parameter is optional | ||
default: | ||
description: Default value if parameter not provided (type depends on 'type' field) | ||
constraints: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering about the use of uploading validation constraints when submitting a workflow run.
I totally agree that setting up validation-constraints in a WES makes sense, if the WES is installing a new workflow. However, this implementation has some disadvantages
- Consistency: Defining such constraints when requesting a workflow run allows that the client define these constraints differently for different runs, or that different clients (users) use different constraints.
- Redundancy: The clients probably would have to provide these
parametersconstraints every time they submit a run, and it would always be the sameparametersconstraints, right? - Competencies: Finally, the user of the WES instance is not necessarily the same person (or expert level) as the administrator -- or whoever is responsible for installing workflows. The current implementation puts the load of defining these constraints on the clients, although it should better be put on some kind of an administrator. This is particularly relevant with human data and tight security constraints that may involve workflow auditing, and also is problematic if there are a lot of clients.
These problems may hint at that we are not modelling the domain in sufficient detail.
Some thoughts in this direction
Logically, to me the validation constraints belong with the workflow, not with the workflow run.
Think of a use case like this:
- Some client application first requests a new workflow. This could be a a data user but also a WES administrator or somebody involved in governance of the instance. Think of security auditing etc. The workflow installation route could have separate authentication.
- The workflow is installed. Dependent on the workflow this may involve a lot of steps, such as downloading containers, running integration tests to verify the installation. (Of course, ideally just some containers would be pulled).
- The clients use the workflow in the usual way.
We currently don't have a separate endpoint for requesting the installation of a new workflow, e.g., for downloading from a TRS. I'm also not saying we need it -- there is value in keeping the API simple -- however, I just wanted to make the point, that we could have such an endpoint and that also that would have some valid use case. An installation of a workflow could be considered a "resource" - in the REST sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, that it seems redundant to pass the parameter type for every workflow run. The parameter types are not changing, so it makes sense to define them with the workflow or even with the engine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, if you are talking about workflow parameters (e.g., bwa-mem parameters) with the workflow.
If you are talking about workflow engine parameters (e.g., number of cores of the control process or whether to rerun (and complement) a previous run), then with the engine.
In both cases, the problem is analogous. Ideally, a similar solution would be found. I think something like EDAM can be helpful in both cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO constraints are typically properties of a workflow in the case of a WES. Putting their definition on the end user feels needless.
If the user is setting these constraints themselves, why not just modify the inputs to match their constraints before submitting the workflow? Where I have heard the request for constraints come from is workflow authors who want to define things like ranges of values, or enums.
Constraints would probably be better living along side TRS or directly within the workflow itself (if the language supported it)
REQUIRED | ||
The workflow CWL or WDL document. When `workflow_attachments` is used to attach files, the `workflow_url` may be a relative path to one of the attachments. | ||
The primary workflow document. When `workflow_attachments` is used to attach files, the `workflow_url` may be a relative path to one of the attachments that is the primary workflow to be executed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is not limiting the use case in the moment ("may be relative"), but it also does not reflect the whole flexibility of the approach, which at least includes the following:
- files uploaded as attachment (e.g.,
file:main.nf
) - TRS URIs to files downloaded from TRS server (e.g.,
trs://server:port/path/to/package
) - files in globally shared (at least for the user group) central workflow installation (e.g.,
file:/workflows/AwesomeWorkflow/main.nf
).
You might want to cover more of use cases here -- also because this YAML is the main documentation, and a more inclusive explanation may help the implementers to understand the standard better.
type: array | ||
items: | ||
type: string | ||
description: An array of one or more acceptable workflow engines. Since a server may support multiple engines and version, the recommendation is to encode the workflow_engine_version array as `<workflow_engine>_<version>` where `workflow_engine` values match this array for clarity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description: An array of one or more acceptable workflow engines. Since a server may support multiple engines and version, the recommendation is to encode the workflow_engine_version array as `<workflow_engine>_<version>` where `workflow_engine` values match this array for clarity. | |
description: An array of one or more acceptable workflow engines. If the server supports multiple engine versions, encode the versions in `workflow_engine_versions`. |
- Better not repeat the documentation of
workflow_engine_version
here, because it distracts from the meaning ofworkflow_engine
and is harder to maintain (DRY principle). - I still think that the reference to
workflow_engine_version
is useful.
description: Named workflow outputs with structured metadata | ||
additionalProperties: | ||
$ref: '#/components/schemas/OutputObject' | ||
OutputObject: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand correctly, that this is either a single value or an array of OutputObject
. So this would be a recursive definition, right? This looks quite powerful. Nice!
description: File integrity hash in format 'algorithm:hash' (e.g., 'sha256:abc123...') | ||
format: | ||
type: string | ||
description: MIME type or format identifier (e.g., EDAM ontology reference) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be useful to directly fix or at least suggest certain aspects of the format
. Maybe the best solution is to at least suggest something like URIs for terms, e.g. MIME types (https://www.iana.org/assignments/media-types/media-types.xhtml), and that multiple terms can be used, separated by commata -- or directly make this field a List[String]
field.
In the extreme you could define this as of type
List[URI]
.
description: Output value (type depends on class field) | ||
location: | ||
type: string | ||
description: Absolute path or URL to the output file/directory (for File/Directory class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why absolute path? Why a path at all? Should this not better be a URI? Consider a WES that makes final files available via S3 buckets.
I guess this should NOT include the basename? Or should it?
description: | ||
type: string | ||
description: Human-readable description of the output | ||
secondary_files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the WES implementer or administrator will often not be able to decide what is primary or secondary. There may be many workflows and a WES instance may even promiscuously allow to run arbitrary workflows downloaded from a TRS, right? Therefore, to be useful this field will usually rely on the output the workflow.
Therefore, I would question whether this adds much value to the API. At most it will be used to hand through information that is anyways available in some workflow output file. But then the client/data user can access the workflow result files and obtains the information from there.
BTW: There may even be cases, where the ideas of the workflow implementer of what constitutes a primary or secondary file might not concord with what the user thinks.
description: JSON-encoded universal workflow parameters (see RunRequest for how to encode) | ||
workflow_type: | ||
type: string | ||
description: Workflow descriptor type must be "CWL", "WDL", "Nextflow", or "Snakemake" currently (or another alternative supported by this WES instance, see service-info) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TRS (from 2.0.1 on ) is using :
- CWL
- WDL
- NFL
- Galaxy
- SMK
We should stick to that as well. See discussion #173
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should be defined as an enum lower down
ServiceInfo: | ||
title: ServiceInfo | ||
allOf: | ||
- $ref: 'https://raw.githubusercontent.com/ga4gh-discovery/ga4gh-service-info/v1.0.0/service-info.yaml#/components/schemas/Service' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big fan of the fact that a part of service_info
is stored somewhere else. Error prone.
Unified parameter format that can be converted to workflow-language-specific format. | ||
If provided, takes precedence over workflow_params. WES implementations should | ||
convert these to the appropriate native format for the specified workflow_type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don’t get me wrong—I’m not opposed to using workflow_unified_params
. However, its impact is limited to a relatively small subset of parameters, primarily those related to compute resources (e.g., cores
, memory
). Other parameters, such as repeat
, require special handling since their behavior varies depending on the engine.
As @vinjana suggested in our discussion, we might consider introducing a parameter property like wes_version
to ensure backward compatibility for these unified parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is workflow_unified_params
not related to workflow_params
rather than workflow_engine_params
? At least the comments suggest that workflow_params
and workflow_unified_params
belong together.
But I understand the confusion, because also for workflow engine parameters one may think of standardizing them in a similar way, with ontology terms, etc.
workflow_engine_parameters: | ||
type: object | ||
additionalProperties: | ||
type: string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description is missing.
JSON-encoded engine-specific parameters
description: JSON-encoded universal workflow parameters (see RunRequest for how to encode) | ||
workflow_type: | ||
type: string | ||
description: Workflow descriptor type must be "CWL", "WDL", "Nextflow", or "Snakemake" currently (or another alternative supported by this WES instance, see service-info) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should be defined as an enum lower down
description: JSON-encoded engine-specific parameters (see RunRequest for how to encode) | ||
workflow_url: | ||
type: string | ||
description: The workflow document. When workflow_attachment is used to attach files, the workflow_url may be a relative path to one of the attachments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description: The workflow document. When workflow_attachment is used to attach files, the workflow_url may be a relative path to one of the attachments. | |
description: The path to the workflow document. When workflow_attachment is used to attach files, the workflow_url may be a relative path to one of the attachments. |
description: '' | ||
required: false | ||
description: >- | ||
Files to be staged for workflow execution. You set the filename/path using the Content-Disposition header in the multipart form submission. For example 'Content-Disposition: form-data; name="workflow_attachment"; filename="workflows/helper.wdl"'. The files are staged to a temporary directory and can be referenced by relative path in the workflow_url. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking in the RFC for filenames, it does seem like it is strongly encouraged to NOT include folder structures in the filename
param.
description: Whether this parameter is optional | ||
default: | ||
description: Default value if parameter not provided (type depends on 'type' field) | ||
constraints: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO constraints are typically properties of a workflow in the case of a WES. Putting their definition on the end user feels needless.
If the user is setting these constraints themselves, why not just modify the inputs to match their constraints before submitting the workflow? Where I have heard the request for constraints come from is workflow authors who want to define things like ranges of values, or enums.
Constraints would probably be better living along side TRS or directly within the workflow itself (if the language supported it)
Unified parameter format that can be converted to workflow-language-specific format. | ||
If provided, takes precedence over workflow_params. WES implementations should | ||
convert these to the appropriate native format for the specified workflow_type. | ||
properties: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should probably be moved into a separate object: WorkflowUnifiedParams
Overview
This pull request updates the Workflow Execution Service (WES) OpenAPI specification to enhance functionality in several key ways without creating breaking changes. Key changes include cleanup of our documentation, making workflow types and engines consistent in our requests/responses, creating a universal workflow parameterization passing structure (vs. relying on workflow engine specific formats), and creating a universal output format to collect outputs in a structured way regardless of workflow engine. Claude Code was used to generate some of these changes.
eLwazi-hosted GA4GH Hackathon
The eLwazi hosted GA4GH hackathon 7/28-8/1 is working on this issue given the need by various groups attending the session. For more info, see the agenda.
Built Documentation
The human-readable documentation: https://ga4gh.github.io/workflow-execution-service-schemas/preview/feature/issue-176-wes-params/docs/index.html
More detailed description
Issues/questions for discussion
Specification Updates generated by Copilot:
Version and Logo Updates:
1.1.0
to1.2.0
.Workflow Types and Engines:
Nextflow
,Snakemake
) in addition toCWL
andWDL
.workflow_engine
andworkflow_engine_versions
properties to specify supported workflow engines and their versions. [1] [2]Parameterization Enhancements:
workflow_unified_params
for universal parameter format, enabling generic parameterization across workflow types.workflow_params
andworkflow_unified_params
fields, clarifying their use cases.Documentation Improvements:
Endpoint Descriptions:
RunWorkflow
endpoint, detailing supported content types (application/json
,multipart/form-data
) and file handling mechanisms.GetServiceInfo
endpoint description to include workflow engines and additional service information.Tags and Groups:
workflowoutputs_model
,outputobject_model
) and grouped them underx-tagGroups
for better organization. [1] [2]Schema Updates:
New Properties:
workflow_type
andworkflow_engine
objects to the schema for defining supported types and engines. [1] [2]Clarifications:
These updates make the WES API more versatile, user-friendly, and aligned with emerging standards in workflow execution.