-
Notifications
You must be signed in to change notification settings - Fork 54
Description
In CRDC driver Project and also in BioDataCatalyst we have a situation where the host of the data would like to provide a guidance on how to use the data, and there to use it.
In other words, they would like that any platform downstream of the DRS Server would compute on the data in certain cloud locations, which usually are the same where the data are from. The reasons for this request are different, going from keeping the egress cost down, to not having the data leaving the security level.
Given that at the end we have download url in DRS, and it would be pretty difficult to enforce the situation, therefore I suggest we go more towards an idea where the host "suggest" what is the preferred way to access the data, and the DRS client accessing these data honor the request to the best of their ability.
Proposal
The proposal aims to enhance the GA4GH DRS (Data Repository Service) specification by introducing a new field that provides metadata regarding the intended usage and location constraints for data objects. This additional field will allow data providers to specify their preferences and requirements for how the data should be accessed and utilized. The proposed field will offer the following options:
-
Cloud Exclusive (cloud_exclusive): the data object is intended for use exclusively within a cloud environment. Users are expected to access and process the data only within a cloud computing infrastructure and not outside of it; cannot download the data on somebody's laptop
-
Cloud Provider-Limited (cloud_provider_limited): the data object should not leave the cloud provider's ecosystem. Users are restricted from moving the data to external locations or platforms. It must remain within the boundaries of the specified cloud provider.
-
Cloud Region-Limited (cloud_region_limited): the data object is restricted to a specific cloud region. Users are required to access and process the data within the designated region and are prohibited from transferring it to other geographic locations within the cloud provider's infrastructure.
By introducing this new field, data providers and administrators can communicate their data access and usage policies more effectively, ensuring that data is handled in accordance with their specific requirements. This addition not only enhances the flexibility of the DRS specification but also strengthens data governance and compliance for genomic and health-related data in cloud-based environments.
It could look like this:
{
"id": "string",
"name": "string",
"self_uri": "drs://drs.example.org/314159",
"size": 1024,
"created_time": "2019-08-24T14:15:22Z",
"updated_time": "2019-08-24T14:15:22Z",
"version": "string",
"mime_type": "application/json",
"checksums": [
{
"checksum": "string",
"type": "sha-256"
}
],
"usage_constraints": {
"access_type": "cloud_exclusive",
"location_constraints": {
"cloud_provider": "AWS",
"cloud_region": "us-west-2"
}
}
}
In this structure:
- usage_constraints is a section within the DRS metadata specifically dedicated to describing data usage and location constraints.
- access_type is a field that specifies how the data should be accessed and used. We can define different values such as
- cloud_exclusive, cloud_provider_limited, cloud_region_limited to represent the intended usage constraints. (mandatory)
- location_constraints is an optional nested section that provides additional details, depending on the access type. For example, it includes
cloud_provider
to specify the preferred cloud provider, andcloud_region
to designate the desired cloud region.
This structured metadata allows data providers to clearly communicate their data access and usage policies, ensuring that users are aware of the intended constraints. It also enables data consumers to make informed decisions about how to handle and access the data. The specific values for access_type
can be defined in the DRS specification, and they should correspond to the proposed usage policy options. This structure helps promote consistency and interoperability across different implementations of the DRS specification.