Skip to content
Merged
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 137 additions & 4 deletions docs/croissant-spec-draft.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the usage of sc:DefinedTerm should not be recommended. It only covers part of the DUO terms and logic so its usage will be confusing for DUO adopters. On the other side, the ODRL approach fully covers DUO, and could scale to other data use conditions, such as those of Data Privacy Vocabulary, with the same mechanism.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It serves the purpose of how to use any simple term vocabulary in Croissant. I agree that for DUO you may most often need the ODRL approach, but this example is still useful for other vocabularies as well.

Original file line number Diff line number Diff line change
Expand Up @@ -491,7 +491,7 @@ Other properties from [schema.org/Dataset](http://schema.org/Dataset) or its par

### Modified and Added Properties

Croissant modifies the meaning of one [schema.org](http://schema.org) property, and requires its presence:
Croissant modifies the meaning of one [schema.org](http://schema.org) property, and makes it required:

<table>
<thead>
Expand All @@ -507,7 +507,7 @@ Croissant modifies the meaning of one [schema.org](http://schema.org) property,
<a href="#fileset">FileSet</a>
</td>
<td>MANY</td>
<td>By contrast with <a href="http://schema.org/Dataset">schema.org/Dataset</a>, Croissant requires the distribution property to have values of type FileObject or FileSet.</td>
<td>By contrast with <a href="http://schema.org/Dataset">schema.org/Dataset</a>, Croissant requires the distribution property to have values of type <a href="#fileobject">FileObject</a> or <a href="#fileset">FileSet</a>. These are subclasses of <a href="http://schema.org/DataDownload">DataDownload</a>, so this definition is compatible with the original definition of the distribution property in schema.org.</td>
</tr>
</table>

Expand All @@ -533,6 +533,15 @@ The Croissant vocabulary also defines the following optional dataset-level attri
<td>"A citation to the dataset itself, or a citation for a publication that describes the dataset. Ideally, citations should be expressed using the <a href="https://www.bibtex.org/">bibtex</a> format.<br>
Note that this is different from <a href="http://schema.org/citation">schema.org/citation</a>, which is used to make a citation to another publication from this dataset.
</td>
</tr>
<tr>
<td>sdVersion</a></td>
<td>
<a href="http://schema.org/Number">Number</a><br>
<a href="http://schema.org/Text">Text</a>
</td>
<td>ONE</td>
<td>The version of the dataset <i>metadata</i>, which may be distinct from the version of the dataset <i>content</i>. This property is modeled after schema.org's <a href="http://schema.org/sdLicense">sdLicense</a> and <a href="http://schema.org/sdPublisher">sdPublisher</a>, and may move to schema.org in the future.</td>
</tr>
</table>

Expand Down Expand Up @@ -2136,9 +2145,133 @@ For example, consider a dataset where each image is labeled by a different human

In this example, the `labeled_images/label` field has an annotation `labeled_images/label/annotator`. The `equivalentProperty` "prov:wasAttributedTo" on the annotation field indicates that each label is attributed to the corresponding person. The person's details (id, gender, age) are pulled from the same source file (`annotations.csv`) on a row-by-row basis. The `gender` and `age` fields are mapped to their corresponding FOAF properties, `foaf:gender` and `foaf:age`, via `equivalentProperty`.

### Data Use Restrictions
### Data Use Conditions

Datasets often come with restrictions on how they can be used, particularly in sensitive domains, such as healthcare. Representing these restrictions in a machine-readable format enables automated discovery and compliance checking. For instance, a healthcare dataset might be restricted to non-commercial research use only, or require specific ethics approval.

Data use conditions can be attached to a dataset as a whole, or part of a dataset using [sc:usageInfo](http://schema.org/usageInfo) (an existing attribute of schema.org).

### Using DUO to Represent Data Use Conditions

The [DUO](http://purl.obolibrary.org/obo/duo.owl) ontology provides a set of terms that can be used to represent data use conditions in a machine-readable format. DUO is prevalent in the healthcare domain. Other vocabularies may be used in other verticals.

To connect with terms from an external vocabulary, Croissant uses the [sc:DefinedTerm](http://schema.org/DefinedTerm) type, which is a schema.org type designed for that purpose.

Here is an example that shows how to use the DUO term [DUO_0000042](http://purl.obolibrary.org/obo/DUO_0000042) to represent the data use condition "General Research Use":

```json
{
"@context": {
"@vocab": "https://schema.org/",
"cr": "http://mlcommons.org/croissant/",
"duo": "http://purl.obolibrary.org/obo/DUO_"
},
"@type": "Dataset",
"name": "Global Health Imagery Dataset",
"description": "A dataset of public health imagery for research purposes.",
"url": "https://example.org/dataset/global-health-1",
"usageInfo": [
{
"@type": "DefinedTerm",
"name": "General Research Use",
"termCode": "DUO_0000042",
"url": "duo:0000042"
}
]
}
```

### **Fine-Grained Control with ODRL**

To represent more complex restrictions, such as hierarchical permissions and modifiers, Croissant recommends using [ODRL](https://www.w3.org/TR/odrl-model/), a W3C standard that provides a rich framework for representing permissions and restrictions

To use ODRL in Croissant, `sc:usageInfo` is used as a container for an `odrl:Offer`, which represents a set of permissions. `odrl:action` represents the permission, and `odrl:constraint` represents modifiers.

The following example shows how to combine DUO and ODRL to represent a data use policy that allows General Research Use ([DUO_0000042](http://purl.obolibrary.org/obo/DUO_0000042)), but only for non-commercial purposes ([DUO_0000018](http://purl.obolibrary.org/obo/DUO_0000018)):

```json
{
"@context": {
"@vocab": "https://schema.org/",
"cr": "http://mlcommons.org/croissant/",
"duo": "http://purl.obolibrary.org/obo/DUO_",
"odrl": "http://www.w3.org/ns/odrl/2/"
},
"@type": "Dataset",
"name": "Restricted Health Data",

"usageInfo": {
"@type": ["CreativeWork", "odrl:Offer"],
"name": "DUO Usage Policy",

"odrl:permission": {
"@type": "odrl:Permission",
"odrl:action": {
"@id": "duo:0000006",
"name": "Health or Medical or Biomedical Use"
},
"odrl:constraint": [
{
"@type": "odrl:Constraint",
"name": "Non-commercial use only",
"odrl:operator": { "@id": "odrl:eq" },
"odrl:rightOperand": { "@id": "duo:0000018" }
}
]
}

}
}
```

### Integration with Domain-Specific Ontologies

In the health domain, it is often necessary to specify that a dataset can only be used for research on a specific disease. DUO recommends using the [MONDO](https://mondo.monarchinitiative.org/) ontology to specify disease-specific restrictions.

The example below shows how to use MONDO in combination with DUO and ODRL to specify that a dataset can only be used for research on Alzheimer's disease ([MONDO_0005070](http://purl.obolibrary.org/obo/MONDO_0005070)).

```json
{
"@context": {
"@vocab": "https://schema.org/",
"cr": "http://mlcommons.org/croissant/",
"duo": "http://purl.obolibrary.org/obo/DUO_",
"mondo": "http://purl.obolibrary.org/obo/MONDO_",
"odrl": "http://www.w3.org/ns/odrl/2/"
},
"@type": "Dataset",
"name": "Restricted Health Data",

"usageInfo": {
"@type": ["CreativeWork", "odrl:Offer"],
"name": "DUO Usage Policy",

"odrl:permission": {
"@type": "odrl:Permission",
"odrl:action": {
"@id": "duo:0000007",
"name": "Disease specific research"
},
"odrl:constraint": [
{
"@type": "odrl:Constraint",
"name": "Non-commercial use only",
"odrl:operator": { "@id": "odrl:eq" },
"odrl:rightOperand": { "@id": "duo:0000018" }
},
{
"@type": "odrl:Constraint",
"odrl:leftOperand": { "@id": "duo:0000010"},
"odrl:operator": { "@id": "odrl:eq" },
"odrl:rightOperand": { "@id": "mondo:0005070" }
}
]
}
}
}
```

TODO: Add guidance on representing data use restrictions.
This approach can be extended to other domain-specific ontologies.

## Appendix 1: JSON-LD context

Expand Down