Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
158 changes: 148 additions & 10 deletions docs/croissant-spec-draft.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the usage of sc:DefinedTerm should not be recommended. It only covers part of the DUO terms and logic so its usage will be confusing for DUO adopters. On the other side, the ODRL approach fully covers DUO, and could scale to other data use conditions, such as those of Data Privacy Vocabulary, with the same mechanism.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It serves the purpose of how to use any simple term vocabulary in Croissant. I agree that for DUO you may most often need the ODRL approach, but this example is still useful for other vocabularies as well.

Original file line number Diff line number Diff line change
Expand Up @@ -491,7 +491,7 @@ Other properties from [schema.org/Dataset](http://schema.org/Dataset) or its par

### Modified and Added Properties

Croissant modifies the meaning of one [schema.org](http://schema.org) property, and requires its presence:
Croissant modifies the meaning of one [schema.org](http://schema.org) property, and makes it required:

<table>
<thead>
Expand All @@ -507,7 +507,7 @@ Croissant modifies the meaning of one [schema.org](http://schema.org) property,
<a href="#fileset">FileSet</a>
</td>
<td>MANY</td>
<td>By contrast with <a href="http://schema.org/Dataset">schema.org/Dataset</a>, Croissant requires the distribution property to have values of type FileObject or FileSet.</td>
<td>By contrast with <a href="http://schema.org/Dataset">schema.org/Dataset</a>, Croissant requires the distribution property to have values of type <a href="#fileobject">FileObject</a> or <a href="#fileset">FileSet</a>. These are subclasses of <a href="http://schema.org/DataDownload">DataDownload</a>, so this definition is compatible with the original definition of the distribution property in schema.org.</td>
</tr>
</table>

Expand All @@ -533,6 +533,15 @@ The Croissant vocabulary also defines the following optional dataset-level attri
<td>"A citation to the dataset itself, or a citation for a publication that describes the dataset. Ideally, citations should be expressed using the <a href="https://www.bibtex.org/">bibtex</a> format.<br>
Note that this is different from <a href="http://schema.org/citation">schema.org/citation</a>, which is used to make a citation to another publication from this dataset.
</td>
</tr>
<tr>
<td>sdVersion</a></td>
<td>
<a href="http://schema.org/Number">Number</a><br>
<a href="http://schema.org/Text">Text</a>
</td>
<td>ONE</td>
<td>The version of the dataset <i>metadata</i>, which may be distinct from the version of the dataset <i>content</i>. This property is modeled after schema.org's <a href="http://schema.org/sdLicense">sdLicense</a> and <a href="http://schema.org/sdPublisher">sdPublisher</a>, and may move to schema.org in the future.</td>
</tr>
</table>

Expand Down Expand Up @@ -1148,11 +1157,10 @@ Sometimes, not all the data from the source is needed, but only a subset. The `E

Croissant supports a few simple transformations that can be applied on the source data:

- delimiter: split a string into an array using the supplied character.
- separator: split a string into an array using the supplied character.
- readLines: read the content of the file line by line.
- unArchive: extract the content of the archive. True by default for archive file types (zip, tgz, etc.).
- regex: A regular expression to parse the data.
- jsonPath: A JSON path to evaluate on the (JSON) data source.
- regex: A regular expression to parse the data, with one capture group that corresponds to the output value.

For example, to extract information from a filename using a regular expression, we can write:

Expand Down Expand Up @@ -2136,9 +2144,133 @@ For example, consider a dataset where each image is labeled by a different human

In this example, the `labeled_images/label` field has an annotation `labeled_images/label/annotator`. The `equivalentProperty` "prov:wasAttributedTo" on the annotation field indicates that each label is attributed to the corresponding person. The person's details (id, gender, age) are pulled from the same source file (`annotations.csv`) on a row-by-row basis. The `gender` and `age` fields are mapped to their corresponding FOAF properties, `foaf:gender` and `foaf:age`, via `equivalentProperty`.

### Data Use Restrictions
### Data Use Conditions

Datasets often come with restrictions on how they can be used, particularly in sensitive domains, such as healthcare. Representing these restrictions in a machine-readable format enables automated discovery and compliance checking. For instance, a healthcare dataset might be restricted to non-commercial research use only, or require specific ethics approval.

Data use conditions can be attached to a dataset as a whole, or part of a dataset using [sc:usageInfo](http://schema.org/usageInfo) (an existing attribute of schema.org).

### Using DUO to Represent Data Use Conditions

The [DUO](http://purl.obolibrary.org/obo/duo.owl) ontology provides a set of terms that can be used to represent data use conditions in a machine-readable format. DUO is prevalent in the healthcare domain. Other vocabularies may be used in other verticals.

To connect with terms from an external vocabulary, Croissant uses the [sc:DefinedTerm](http://schema.org/DefinedTerm) type, which is a schema.org type designed for that purpose.

Here is an example that shows how to use the DUO term [DUO_0000042](http://purl.obolibrary.org/obo/DUO_0000042) to represent the data use condition "General Research Use":

```json
{
"@context": {
"@vocab": "https://schema.org/",
"cr": "http://mlcommons.org/croissant/",
"duo": "http://purl.obolibrary.org/obo/DUO_"
},
"@type": "Dataset",
"name": "Global Health Imagery Dataset",
"description": "A dataset of public health imagery for research purposes.",
"url": "https://example.org/dataset/global-health-1",
"usageInfo": [
{
"@type": "DefinedTerm",
"name": "General Research Use",
"termCode": "DUO_0000042",
"url": "duo:0000042"
}
]
}
```

### **Fine-Grained Control with ODRL**

To represent more complex restrictions, such as hierarchical permissions and modifiers, Croissant recommends using [ODRL](https://www.w3.org/TR/odrl-model/), a W3C standard that provides a rich framework for representing permissions and restrictions

To use ODRL in Croissant, `sc:usageInfo` is used as a container for an `odrl:Offer`, which represents a set of permissions. `odrl:action` represents the permission, and `odrl:constraint` represents modifiers.

The following example shows how to combine DUO and ODRL to represent a data use policy that allows General Research Use ([DUO_0000042](http://purl.obolibrary.org/obo/DUO_0000042)), but only for non-commercial purposes ([DUO_0000018](http://purl.obolibrary.org/obo/DUO_0000018)):

```json
{
"@context": {
"@vocab": "https://schema.org/",
"cr": "http://mlcommons.org/croissant/",
"duo": "http://purl.obolibrary.org/obo/DUO_",
"odrl": "http://www.w3.org/ns/odrl/2/"
},
"@type": "Dataset",
"name": "Restricted Health Data",

"usageInfo": {
"@type": ["CreativeWork", "odrl:Offer"],
"name": "DUO Usage Policy",

"odrl:permission": {
"@type": "odrl:Permission",
"odrl:action": {
"@id": "duo:0000006",
"name": "Health or Medical or Biomedical Use"
},
"odrl:constraint": [
{
"@type": "odrl:Constraint",
"name": "Non-commercial use only",
"odrl:operator": { "@id": "odrl:eq" },
"odrl:rightOperand": { "@id": "duo:0000018" }
}
]
}

}
}
```

### Integration with Domain-Specific Ontologies

TODO: Add guidance on representing data use restrictions.
In the health domain, it is often necessary to specify that a dataset can only be used for research on a specific disease. DUO recommends using the [MONDO](https://mondo.monarchinitiative.org/) ontology to specify disease-specific restrictions.

The example below shows how to use MONDO in combination with DUO and ODRL to specify that a dataset can only be used for research on Alzheimer's disease ([MONDO_0005070](http://purl.obolibrary.org/obo/MONDO_0005070)).

```json
{
"@context": {
"@vocab": "https://schema.org/",
"cr": "http://mlcommons.org/croissant/",
"duo": "http://purl.obolibrary.org/obo/DUO_",
"mondo": "http://purl.obolibrary.org/obo/MONDO_",
"odrl": "http://www.w3.org/ns/odrl/2/"
},
"@type": "Dataset",
"name": "Restricted Health Data",

"usageInfo": {
"@type": ["CreativeWork", "odrl:Offer"],
"name": "DUO Usage Policy",

"odrl:permission": {
"@type": "odrl:Permission",
"odrl:action": {
"@id": "duo:0000007",
"name": "Disease specific research"
},
"odrl:constraint": [
{
"@type": "odrl:Constraint",
"name": "Non-commercial use only",
"odrl:operator": { "@id": "odrl:eq" },
"odrl:rightOperand": { "@id": "duo:0000018" }
},
{
"@type": "odrl:Constraint",
"odrl:leftOperand": { "@id": "duo:0000010"},
"odrl:operator": { "@id": "odrl:eq" },
"odrl:rightOperand": { "@id": "mondo:0005070" }
}
]
}
}
}
```

This approach can be extended to other domain-specific ontologies.

## Appendix 1: JSON-LD context

Expand All @@ -2150,6 +2282,7 @@ TODO: Add guidance on representing data use restrictions.
"cr": "http://mlcommons.org/croissant/",
"rai": "http://mlcommons.org/croissant/RAI/",
"dct": "http://purl.org/dc/terms/",
"annotation": "cr:annotation",
"arrayShape": "cr:arrayShape",
"citeAs": "cr:citeAs",
"column": "cr:column",
Expand All @@ -2163,10 +2296,13 @@ TODO: Add guidance on representing data use restrictions.
"@id": "cr:dataType",
"@type": "@vocab"
},
"separator": "cr:separator",
"equivalentProperty": "cr:equivalentProperty",
"examples": {
"@id": "cr:examples",
"@type": "@json"
},
"excludes": "cr:excludes",
"extract": "cr:extract",
"field": "cr:field",
"fileProperty": "cr:fileProperty",
Expand All @@ -2180,14 +2316,16 @@ TODO: Add guidance on representing data use restrictions.
"key": "cr:key",
"md5": "cr:md5",
"parentField": "cr:parentField",
"path": "cr:path",
"recordSet": "cr:recordSet",
"references": "cr:references",
"regex": "cr:regex",
"replace": "cr:replace",
"readLines": "cr:readLines",
"sdVersion": "cr:sdVersion",
"separator": "cr:separator",
"source": "cr:source",
"subField": "cr:subField",
"transform": "cr:transform"
"transform": "cr:transform",
"unArchive": "cr:unArchive",
"value": "cr:value",
}
```
54 changes: 39 additions & 15 deletions docs/croissant.ttl
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,12 @@
croissant:FileObject a rdf:Class ;
rdfs:label "FileObject" ;
rdfs:comment "An individual file that is part of a dataset." ;
rdfs:subClassOf schema:CreativeWork .
rdfs:subClassOf schema:DataDownload .

croissant:FileSet a rdf:Class ;
rdfs:label "FileSet" ;
rdfs:comment "A set of homogeneous files extracted from a container, optionally filtered by inclusion and/or exclusion filters." ;
rdfs:subClassOf schema:Intangible .
rdfs:subClassOf schema:DataDownload .

croissant:RecordSet a rdf:Class ;
rdfs:label "RecordSet" ;
Expand Down Expand Up @@ -84,21 +84,27 @@ croissant:citeAs a rdf:Property ;
schema:domainIncludes schema:Dataset ;
schema:rangeIncludes schema:Text .

croissant:sdVersion a rdf:Property ;
rdfs:label "sdVersion" ;
rdfs:comment "The version of the dataset metadata, which may be distinct from the version of the dataset content." ;
schema:domainIncludes schema:Dataset ;
schema:rangeIncludes schema:Number, schema:Text .

# FileObject & FileSet properties

croissant:containedIn a rdf:Property ;
rdfs:label "containedIn" ;
rdfs:comment "Another FileObject or FileSet that this one is contained in, e.g., in the case of a file extracted from an archive. When this property is present, the contentUrl is evaluated as a relative path within the container object." ;
rdfs:comment "Another FileObject, FileSet or DataSource that this one is contained in, e.g., in the case of a file extracted from an archive. When this property is present, the contentUrl is evaluated as a relative path within the container object." ;
schema:domainIncludes croissant:FileObject, croissant:FileSet ;
schema:rangeIncludes croissant:FileObject, croissant:FileSet .
schema:rangeIncludes croissant:FileObject, croissant:FileSet, croissant:DataSource .

croissant:includes a rdf:Property ; # Should this be named includePattern instead?
croissant:includes a rdf:Property ;
rdfs:label "includes" ;
rdfs:comment "A glob pattern that specifies the files to include, e.g., \".jpg\", \"/foo/pic*.jpg\". The pattern is evaluated from the root of the containedIn contents." ;
schema:domainIncludes croissant:FileSet ;
schema:rangeIncludes schema:Text .

croissant:excludes a rdf:Property ; # Should this be named excludePattern instead?
croissant:excludes a rdf:Property ;
rdfs:label "excludes" ;
rdfs:comment "A glob pattern that specifies the files to exclude. The pattern is evaluated from the root of the containedIn contents, after the includes patterns have been evaluated." ;
schema:domainIncludes croissant:FileSet ;
Expand Down Expand Up @@ -130,6 +136,12 @@ croissant:examples a rdf:Property ;
schema:domainIncludes croissant:RecordSet ;
schema:rangeIncludes rdf:JSON .

croissant:annotation a rdf:Property ;
rdfs:label "annotation" ;
rdfs:comment "One or more data-level annotations that apply to the entire record or field." ;
schema:domainIncludes croissant:RecordSet, croissant:Field ;
schema:rangeIncludes croissant:Field .

croissant:source a rdf:Property ;
rdfs:label "source" ;
rdfs:comment "The data source of the field. This will generally reference a FileObject or FileSet's contents (e.g., a specific column of a table)." ;
Expand All @@ -142,6 +154,12 @@ croissant:dataType a rdf:Property ;
schema:domainIncludes croissant:RecordSet, croissant:Field ;
schema:rangeIncludes croissant:DataType .

croissant:value a rdf:Property ;
rdfs:label "value" ;
rdfs:comment "An optional constant value for the field." ;
schema:domainIncludes croissant:Field ;
schema:rangeIncludes rdf:JSON .

croissant:repeated a rdf:Property ;
rdfs:label "repeated" ;
rdfs:comment "If true, then the Field is a list of values of type dataType." ;
Expand Down Expand Up @@ -238,24 +256,30 @@ croissant:jsonPath a rdf:Property ;

# Transform properties

croissant:delimiter a rdf:Property ;
rdfs:label "delimiter" ;
rdfs:comment "A delimiter to use parse the data into an array." ;
croissant:separator a rdf:Property ;
rdfs:label "separator" ;
rdfs:comment "A separator to use parse the data into an array." ;
schema:domainIncludes croissant:Transform ;
schema:rangeIncludes schema:Text .

croissant:readLines a rdf:Property ;
rdfs:label "readLines" ;
rdfs:comment "Read the content of the file line by line." ;
schema:domainIncludes croissant:Transform ;
schema:rangeIncludes schema:Boolean .

croissant:unArchive a rdf:Property ;
rdfs:label "unArchive" ;
rdfs:comment "Extract the content of the archive." ;
schema:domainIncludes croissant:Transform ;
schema:rangeIncludes schema:Boolean .

croissant:regex a rdf:Property ;
rdfs:label "regex" ;
rdfs:comment "A regular expression to apply to the data." ;
schema:domainIncludes croissant:Transform ;
schema:rangeIncludes schema:Text .

croissant:jsonQuery a rdf:Property ;
rdfs:label "jsonQuery" ;
rdfs:comment "For JSON content, a query to evaluate on the data." ;
schema:domainIncludes croissant:Transform ;
schema:rangeIncludes schema:Text .

### ML-specific definitions

croissant:Split a rdf:class ;
Expand Down