-
Notifications
You must be signed in to change notification settings - Fork 37
Data Access Credentials #227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
1. Modified workflow_params description to indicate it's required unless workflow_unified_params is provided 2. Added workflow_unified_params field with the hybrid structure we discussed 3. Enhanced file metadata support with optional fields for size, checksum, secondary files, format, and modification time 4. Added comprehensive validation constraints for different parameter types 5. Validated the schema - no OpenAPI validation issues detected Key Features of the Implementation: - Version field for format evolution (default: 1.0) - Rich file metadata (size, checksum, secondary_files, format, last_modified) - Comprehensive validation constraints (min/max, length, pattern, enum, array limits) - Type-safe parameter definitions with clear enums - Backward compatibility - existing workflow_params still works - Precedence handling - workflow_unified_params takes precedence when provided
Key Improvements Made: 1. Dual Content Type Support - application/json (Recommended): Uses the proper RunRequest model object - multipart/form-data (Legacy): Maintains backward compatibility for file uploads 2. Proper Model Usage - JSON requests now use $ref: '#/components/schemas/RunRequest' - Leverages all the rich typing and validation from the RunRequest schema - Supports both workflow_params and workflow_unified_params 3. Enhanced Documentation - Clear guidance on when to use each content type - Explains file handling differences between formats - Documents the new unified parameter format - Security considerations for file uploads 4. Better Developer Experience - OpenAPI tooling can generate proper client code for JSON requests - Type safety with structured objects instead of string parsing - Validation happens automatically with the model schema - Consistency across the API Usage Examples: Preferred JSON format: POST /runs Content-Type: application/json { "workflow_type": "CWL", "workflow_type_version": "v1.0", "workflow_url": "https://example.com/workflow.cwl", "workflow_unified_params": { "version": "1.0", "parameters": { "input_file": { "type": "File", "value": "gs://bucket/input.fastq", "file_metadata": { "size": 1073741824, "checksum": "sha256:abc123..." } } } } } Legacy multipart format (when file uploads needed): POST /runs Content-Type: multipart/form-data workflow_type: CWL workflow_unified_params: {"version":"1.0","parameters":{...}} workflow_attachment: [binary file data]
1. Updated RunLog schema - Added structured_outputs field alongside the existing outputs 2. Added WorkflowOutputs schema - Main container for structured outputs with version and metadata 3. Added OutputObject schema - Flexible output type supporting Files, Directories, Arrays, and primitives 4. Added documentation tags - Both schemas appear in the Models section of the API docs Key Features Implemented: WorkflowOutputs Schema: - Version field for format evolution - Named outputs with rich metadata - Workflow-level metadata (execution ID, timing, resource usage) - Provenance tracking (engine, version, status) OutputObject Schema: - Type system - File, Directory, Array, String, Integer, Float, Boolean - File metadata - location, size, checksum, format, basename - Provenance - source task, command, creation time - Secondary files - Associated files like indexes - Array support - Collections of outputs - Content embedding - Small file contents can be included Backward Compatibility: - Existing outputs field remains unchanged (marked as "legacy format") - structured_outputs is optional - implementations can provide either or both - No breaking changes to existing API consumers
…ondary workflow URLs beyond the primary workflow
Co-authored-by: Copilot <[email protected]>
…low-execution-service-schemas into feature/issue-176-wes-params
Co-authored-by: Copilot <[email protected]>
… in WES (except /service-info)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the Workflow Execution Service (WES) OpenAPI specification to version 1.2.0, adding GA4GH Passport authentication support and data access credential management capabilities. The changes maintain backward compatibility while enabling workflow engines to securely access input data and provide credentials for output access.
- Adds POST endpoints with GA4GH Passport authentication support alongside existing bearer token authentication
- Introduces structured credential management for input and output data access via DRS and other services
- Implements unified parameter format for workflow language-agnostic parameter passing
- Expands workflow engine support beyond CWL/WDL to include Nextflow and Snakemake
Comments suppressed due to low confidence (1)
openapi/workflow_execution_service.openapi.yaml:7
- The updated logo URL path may not exist. The original path used 'ga4gh-theme' while the new path uses 'ga4gh/dist/assets/svg/logos/'. Verify this URL is accessible and correct.
url: 'https://www.ga4gh.org/wp-content/themes/ga4gh/dist/assets/svg/logos/logo-full-color.svg'
x-in: body | ||
bearerFormat: JWT | ||
description: | ||
A valid GA4GH Passport must be passed in the body of an HTTP POST request as a passports[] array. |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'x-in: body' extension in PassportAuth security scheme is non-standard OpenAPI. Standard security schemes don't support body-based authentication. This could cause issues with code generation tools and API documentation renderers.
x-in: body | |
bearerFormat: JWT | |
description: | |
A valid GA4GH Passport must be passed in the body of an HTTP POST request as a passports[] array. | |
bearerFormat: JWT | |
description: | |
A valid GA4GH Passport must be included in the request body as a passports[] array for relevant endpoints. |
Copilot uses AI. Check for mistakes.
OPTIONAL | ||
The workflow run parameterizations (JSON encoded), including input and output file locations. | ||
Either `workflow_params` or `workflow_unified_params` must be provided, `workflow_unified_params` takes precedence. | ||
workflow_unified_params: | ||
type: object | ||
description: >- | ||
OPTIONAL | ||
The workflow run parameterizations (JSON encoded), including input and output file locations | ||
Unified parameter format that can be converted to workflow-language-specific format. | ||
If provided, takes precedence over workflow_params. WES implementations should | ||
convert these to the appropriate native format for the specified workflow_type. |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow_params field description states it's 'OPTIONAL' but then says 'Either workflow_params or workflow_unified_params must be provided'. This creates confusion about whether the field is truly optional. Consider clarifying that one of the two parameter fields is required.
Copilot uses AI. Check for mistakes.
type: array | ||
items: | ||
type: string | ||
description: an array of one or more acceptable types for the `workflow_type`, must be "CWL", "WDL", "Nextflow", or "Snakemake" currently (or another alternative supported by this WES instance, see service-info) |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's inconsistent capitalization for 'Snakemake'. In line 13, it's written as 'Snakemake' but in line 1380 and elsewhere it appears as 'Snakemake'. The correct spelling is 'Snakemake' (not 'SnakeMake').
Copilot uses AI. Check for mistakes.
required: false | ||
description: >- | ||
Files to be staged for workflow execution. You set the filename/path using the Content-Disposition header in the multipart form submission. For example 'Content-Disposition: form-data; name="workflow_attachment"; filename="workflows/helper.wdl"'. The files are staged to a temporary directory and can be referenced by relative path in the workflow_url. | ||
required: |
Copilot
AI
Jul 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'required' array for multipart/form-data schema only includes workflow_type, workflow_type_version, and workflow_url, but doesn't include the passport field when using passport authentication. This could allow requests without proper authentication to be considered valid.
required: | |
required: | |
- passport |
Copilot uses AI. Check for mistakes.
Thanks a lot for this effort, @briandoconnor! But I need to be honest here: I always had a strong dislike for the Apart from the security concerns which you already mentioned (sooner or later, someone will end up logging the request - or, in this case, also the response - bodies), I think reserving I guess there are cases where breaking REST semantics (and implications following from that, e.g., with respect to caching and idempotency) is an acceptable tradeoff. But in the case of passing around Passports, it just seems like a bad design decision on top of a bad design decision to me: Ever since the April Connect in Switzerland, it was abundantly discussed and, IIRC, predominantly agreed with, that a generic mechanism by which assertions/visas could be provided back to a clearing house upon presentation of a small Passport would constitute a cleaner design for Passports themselves, as well as for any APIs consuming them. Have sentiments changed since? If so, do you know what's the basis for the change of heart? Besides, I think if credentials need to be passed around in the cloud, the state of the art seems to be the use of secret stores. In particular, HashiCorp Vault's API is open sourced and is a de factor industry standard. Wouldn't it be better to explore an approach like that, which could be reused for all such cases (data access, compute access, tool access credentials in DRS, TRS, WES, TES etc.), rather than complicating schemas and duplicating endpoints across a whole bunch of API products individually? |
Sorry for interfering, but I don't think it is a wise decision to adopt a "de facto industry standard" that is owned by a company for an open standard like the GA4GH standards are. There are numerous cases, where such companies arbitrarily changed their license (Akka, Anaconda, and you can probably find more) and suddenly users had to adapt. A standard should be grounded on a stronger foundation than the good will of a company. Sure, in both mentioned examples there were workarounds (Akka->Pekko, Anaconda->Mamba & Co.), but that does not mean they were trivial (the Anaconda licensing issue forced us to do lots of tests to ensure our workflows were backwards compatible). While such a change may be trivial for a standard -- just write "use the new fork of the old tool" -- the situation may be very different for the implementers and the operating institutions. The Hashicorp Vault license says
This sounds vague and arbitrary. If Hashicorp decides to make their money in a different way that overlaps with the use of their tool in GA4GH we have to change the standard. This is the current license. So, I interpret that as that even for the current vault version a normal fork, like with Akka->Pekko, would not even be a legally solid option. I'm not a lawyer, sure, so I could be completely wrong. |
You are not interfering, @vinjana - every voice is important! I am not generally disagreeing with your important point, but I would like to clarify two points:
I would also like to note that there were previous discussions within the GA4GH community around potentially recommending existing standards in favor of drafting new ones, in order to avoid this situation. Personally, I think that makes a lot of sense, especially in areas that are not intrinsically related to life sciences, genomics, healthcare etc. AFAIK, no official decision has been made or policy drafted to define rules around recommending external standards, e.g., with respect to the licensing terms, governance, revising recommendations etc. In this regard, the Vault API may be as good an example as we might get (at least in the Cloud WS) as a driver for such a discussion. FWIW, I wholeheartedly agree with you that we should probably not recommend an API spec or other standard that is not published under an OSI-approved license. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can understand the challenge that this PR is trying to solve, however I fall in line with @uniqueg opinion here. I think breaking REST semantics is essentially a bad idea that will make the API harder to implement but also much less intuitive for the end user. IN particular I think making endpoints polymorphic makes writing clients in certain languages harder and also makes machine Interoperability harder IMO. If we really want to support POST
then we may want to propose a totally separate RPC based api where these type of semantics are more accepted.
Additionally, If Passports are simply not usable within the context of a RESTful API service, it feels like we should be rethinking passports (similar to what @uniqueg mentioned as well), instead of propagating a bad design paradigm. Internally we use token-exchange extensively which dramatically reduces the size of tokens that we pass around. Additionally, I know clouds have essentially the same thing, they typically call them things like "Workload Identity Federation". These are not vender specific but largely are using token exchange under the hood
These are not vendor specific solutions and I would highly urge us to evaluate standards that are being developed by the broader software industry as they become available. Passport was initially created before things like token-exchange were adopted, but maybe we should revisit (not to use necessarily, just for inspiration.
Fundamentally, this PR is trying to achieve granting the engine access to data from a federated source using a passport or some other access credential. I think that is a hard problem but maybe for a different reason.
Unlike TES, which is low level and can actually define how a single task interacts with underlying data (Ie it can write the command to use the credential), WES is a high level abstraction that in almost all cases sits in front of an engine. Any functionality we specify WES needs to fall into two categories of support
- Something the engines are willing to adopt
- Something the WES Server can do itself before handing off to the WES engine.
Data access is a hard one. I have done it in a similar way that is proposed here in a WES service. Essentially you end up needing to copy the data to a bucket from within your server or have some other workflow / batch job run. While this is technically doable (i did it), it was slow, error prone and caused data duplication. I could not read the data in-place once in the workflow.
On the flip side, we could not actually do this at the engine level because none of the engines supported this type of resolution and we are not an engine author ourselves. Additionally, most engines (at least for WDL) do not have a way of handling secrets, which basically means passing in sensitive information to the workflow itself should be avoided.
I don't think this means we cannot do it. BUT I do think we often are overlooking the key element here... the engine. A lot of WES requires the feature set provided by the engines
structured_outputs: | ||
$ref: '#/components/schemas/WorkflowOutputs' | ||
description: Structured workflow outputs with rich metadata and type safety | ||
output_credentials: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we should be embedding output credentials in the outputs, ideally we should NEVER have to return credentials but at most use something like a signed URL.
There are a few concerns here
- Much less secure and exposes access credentials. I think returning DRS ids and then having the caller call each individually is moderately better, but even that pattern is not ideal
- The engine may not be able to generate output credentials
- Output credentials can be difficult to produce or lock down scope. IE a passport, bearertoken, api key could allow the user to access much more then JUST the outputs of the current run. A lot of DRS apis returned signed URls but these would be expensive to produce for a large workflow
Overview
This pull request updates the Workflow Execution Service (WES) OpenAPI specification (starting with #226) to enhance functionality in several key ways without creating breaking changes (hopefully). Key changes include 1) adding POST endpoints to support authentication/authorizing with a GA4GH Passport rather than a bearer token, 2) providing credentials for a WES server to access DRS (or other) inputs, and 3) for providing back credentials for the output of the workflow run to be accessible. Claude Code was used to generate some of these changes.
Related issues
workflow_params_drs_headers
field toRunWorkflow
endpoint #131Related Standards
This implementation aligns with:
data access patterns
for backward compatibility
Related PR
The feature branch behind this PR is based on
feature/issue-176-wes-params
and notdevelop
since I was building off of the structured input/output. See #226eLwazi-hosted GA4GH Hackathon
The eLwazi hosted GA4GH hackathon 7/28-8/1 is working on this issue given the need by various groups attending the session. For more info, see the agenda.
Built Documentation
The human-readable documentation: https://ga4gh.github.io/workflow-execution-service-schemas/preview/feature/issue-18-data-access-creds/docs/index.html
Issues/questions for discussion
POST /runs/list
endpoint instead of overloadingPOST /runs
based on what request body schema is used?not supported... useful for the POST endpoints on existing paths.
Authentication/Authorization for WES Requests
I followed the same pattern here as used in DRS, making POST methods for existing endpoints that now support including a passport array in the body. These mirror the existing GET endpoints that use a bearer token.
Making Requests
Here's an example showing a bearer token request vs. the Passport in body request:
POST and /runs endpoint
This approach makes the
/runs
endpoint complicated. It feels overloaded since it's doing multiple things:GET /runs
with a bearer token it's a list of runsPOST /runs
with a bearer token (or Passport in the body) and eitherapplication/json
body schema of typePassportRunRequest
ormultipart/form-data
you get a new workflow runPOST /runs
with a passport in the body usingapplication/json
body schema of typeRunListRequest
then you get a list of runsSo it might be more clear if we have a
POST /runs/list
endpoint instead... so/runs
is less overloaded.Sending Credentials
In addition to adding support for Passports in WES requests (as an alternative to bearer tokens), this PR includes syntax extensions to allow for providing credentials to access inputs and pass back credentials so callers can access outputs.
Data Inputs
Here's how a client would structure a workflow submission:
Data Outputs
Now GetRunLog responses can include credentials for accessing outputs:
Security Considerations
access the outputs. We need to really think about how this works for Passports since typically these are issued through an OIDC flow and not just handed around. API keys and bearer tokens are a little easier since it would be possible to issue temporary/limited scoped tokens.
Implementation Notes
passport credentials
failures.