Skip to content

Conversation

briandoconnor
Copy link
Contributor

@briandoconnor briandoconnor commented Aug 1, 2025

Overview

This pull request updates the Data Repository Service (DRS) OpenAPI specification to enhance functionality in several key ways without creating breaking changes (hopefully). Key changes include 1) ability to create new objects, 2) providing a mechanism for callers to identify write back locations (e.g. clouds+regions, on premise systems), 3) providing a mechanism for callers to see what locations they are authorized to write to. Claude Code was used to generate some of these changes.

Related issues

Related Standards

This implementation aligns with:

eLwazi-hosted GA4GH Hackathon

The eLwazi hosted GA4GH hackathon 7/28-8/1 is working on this issue given the need by various groups attending the session. For more info, see the agenda.

Built Documentation

The human-readable documentation: https://ga4gh.github.io/data-repository-service-schemas/preview/feature/issue-415-write-support/docs/

Issues/questions for discussion

  • do we want bulk options here as well?
  • how would a URL work for write back for an on premise solution? I guess the DRS server implementation would need to handle this?
  • does the upload URL support more advanced upload techniques like multi-threading?
  • need to think about a shared filesystem in addition to upload URL option, for systems that might want to use that approach (e.g. shared filesystem for example)
  • do we need a cleaner way to say DRS write is optional? And valid 1.6 implementations could completely lack upload endpoints? Or do we want to rely on error codes like 501 not implemented?

Key Benefits

  • Multi-Location Support: Users can upload to multiple cloud regions/providers
  • Authorization Aware: Check permissions before attempting uploads
  • Efficient URL Management: Request upload URLs on-demand to avoid expiration
  • Flexible Replication: Upload to additional locations before finalizing
  • Resource Management: Quota tracking per location
  • Discovery: Service capabilities clearly advertised

Workflow Examples

Simple single-location upload:

  1. Check service-info → see available locations
  2. POST /objects with target_storage_location
  3. Upload to provided URL
  4. POST /objects/{id}/finalize

Multi-location replication:

  1. Check write authorizations → see permitted locations
  2. POST /objects with primary location
  3. Upload to primary location
  4. POST /objects/{id}/upload-urls for secondary location
  5. Upload to secondary location
  6. POST /objects/{id}/finalize with both locations

See Ticket for More Information

Ticket #415 has much more information including approaches implemented and what that looks like from a request/response perspective.

@briandoconnor briandoconnor changed the base branch from master to develop August 1, 2025 12:37
@briandoconnor
Copy link
Contributor Author

briandoconnor commented Aug 11, 2025

@patmagee pointed out that the DEL endpoint doesn't support passports!?! We need to rethink this, maybe add a POST /drs/object/id/delete endpoint?

From ChatGPT 5:

Yes — the HTTP/1.1 spec does not forbid a DELETE request from having a body (payload), but the semantics are tricky:

  • RFC 7231, Section 4.3.5 (the most relevant modern HTTP/1.1 spec) says:

    A payload within a DELETE request has no defined semantics; sending a payload body on a DELETE request might cause some existing implementations to reject the request.

  • In practice, many servers, proxies, and frameworks ignore a DELETE request body entirely, or even reject it, because historically most DELETE requests have no payload.

  • REST guidelines generally assume that the URI identifies the resource to delete and that any necessary parameters go in the path or query string, not in the body.

Key implications:

  • You can send a payload in a DELETE request from a client (e.g., using curl -X DELETE -d '{...}'), but you can’t rely on it being supported unless you know the server is explicitly designed to read and use it.

  • If you need to send additional instructions (e.g., conditional delete, batch delete), it’s often safer to:

    • Encode parameters in query strings:
      DELETE /items?type=old
    • Or use a different method (like POST to a /delete endpoint) if you must pass complex JSON.

If you want, I can give you examples of DELETE requests with a JSON body and how popular frameworks like Express, Flask, and Spring Boot handle them — since the real-world behavior is inconsistent. Would you like me to do that?

@grsr
Copy link
Collaborator

grsr commented Aug 11, 2025

Hi @briandoconnor - Thanks for the PR which has stimulated me to respond with some details of a different implementation that we've been working on for a use case in the GEL context. To summarise our approach; GEL forms part of the UK NHS Genomic Medicine Service (GMS) and we are exploring DRS as a standard to share genomic files with partners in the GMS. A new use case for us is to enable genomic labs to share genomic data with GEL such that it is then available over a DRS API.

As part of our DRS implementation we already added support for POST requests on the /objects endpoint which simply writes a fully constructed DRS object to the database that backs our existing implementation. This solves the metadata upload problem in a straightforward way, but we also wanted to support some means of negotiating where the files themselves should be uploaded to (for us this is currently a GEL managed AWS S3 bucket) that supports multiple cloud storage suppliers and potentially on-prem systems as well. To this end we have implemented a separate /upload-request endpoint where a client POSTs a payload that looks like:

{
  "objects": [
    {
      "name": "my_data.fastq",
      "size": 12345,
      "mime_type": "text/fastq",
      "checksums": [
        {
          "checksum": "string",
          "type": "string"
        }
      ],
      "description": "my FASTQ",
      "aliases": [
        "string"
      ]
    }
  ]
}

indicating that the client would like to upload a file of 12345 bytes and with the supplied name and checksum somewhere (objects is an array so multiple files can be requested at once, this is useful so that the server can choose to co-locate related objects such as a CRAM file and its index, or 2 FASTQ files from a paired-end sequencing run, this would be implementation dependent though).

The server responds with a response that looks like:

{
  "objects": {
    "my_data.fastq": {
      "id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
      "self_uri": "string",
      "name": "my_data.fastq",
      "size": 12345,
      "mime_type": "text/fastq",
      "checksums": [
        {
          "checksum": "string",
          "type": "string"
        }
      ],
      "description": "string",
      "aliases": [
        "string"
      ],
      "upload_methods": [
        {
          "type": "s3",
          "access_url": {
            "url": "s3://bucket/some/prefix",
            "headers": [
              "string"
            ]
          },
          "region": "string",
          "credentials": {
            "AccessKeyId": "string",
            "SecretAccessKey": "string",
            "SessionToken": "string"
          }
        },
        {
          "type": "https",
          "access_url": {
            "url": "https://pre.signed.url/?X-Aws...",
            "headers": [
              "string"
            ]
          }
        }
      ]
    }
  }
}

The client can then select their preferred upload_method (intended as the analogue of the existing DRS access_method) from those offered by the server, and upload the data using the details supplied. For general HTTPS support we return an S3 pre-signed URL which can directly be POSTed to, but for large genomic data files we want to be able to take advantage of multi-part uploads and other optimisations already implemented in AWS tools, so we also implement an S3 upload_method where we supply time limited AWS credentials which allow the client to use native AWS libraries etc. to upload data to S3 in the bucket and using the prefix supplied. This mechanism can naturally be extended to support additional cloud providers, and also additional protocols such as SFTP etc. which might form one way to upload data to on-prem file systems if they support an SFTP interface (or indeed an S3 interface as is increasingly common). Note that the spec for the optional credentials field on an upload_method is simply that it is a JSON dictionary, so the spec for this endpoint is completely neutral wrt. to implementation specific approaches to auth.

Once the data is uploaded using the selected upload_method the client then writes a full DRS object to the DRS server using a POST request to the /objects endpoint, and we make no further changes to the existing DRS spec. It would also be possible to use a new endpoint to POST the DRS object itself if minimal interference with existing implementations is desirable.

A feature we like about this implementation is that the /upload-request endpoint is completely separate from the rest of the DRS endpoints and so can be entirely optional. We have tried to keep the payloads as close to the eventual DRS object as possible, but this is not a hard requirement, and there is lots of scope to extend the payloads to support negotiating additional constraints, such as the object size limit that you include in your implementation.

Comments on this suggestion are very welcome, and I'm happy to share openapi specs and/or example client and server code from our prototype implementation of this scheme if that would be useful. If so, please let me know whether a separate PR would make sense?

I write this comment without completely reviewing your suggested implementation just so that you're aware of how we're thinking about this, and before I get distracted again ;) I will review your PR in more detail and revert with more thoughts on how we could potentially combine these approaches if that would be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants