Skip to content

Commit 4a249b8

Browse files
authored
Add sdVersion property, and Data Use Conditions section to the 1.1 spec. (#983)
1 parent 8e239c0 commit 4a249b8

File tree

2 files changed

+187
-25
lines changed

2 files changed

+187
-25
lines changed

docs/croissant-spec-draft.md

Lines changed: 148 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -491,7 +491,7 @@ Other properties from [schema.org/Dataset](http://schema.org/Dataset) or its par
491491

492492
### Modified and Added Properties
493493

494-
Croissant modifies the meaning of one [schema.org](http://schema.org) property, and requires its presence:
494+
Croissant modifies the meaning of one [schema.org](http://schema.org) property, and makes it required:
495495

496496
<table>
497497
<thead>
@@ -507,7 +507,7 @@ Croissant modifies the meaning of one [schema.org](http://schema.org) property,
507507
<a href="#fileset">FileSet</a>
508508
</td>
509509
<td>MANY</td>
510-
<td>By contrast with <a href="http://schema.org/Dataset">schema.org/Dataset</a>, Croissant requires the distribution property to have values of type FileObject or FileSet.</td>
510+
<td>By contrast with <a href="http://schema.org/Dataset">schema.org/Dataset</a>, Croissant requires the distribution property to have values of type <a href="#fileobject">FileObject</a> or <a href="#fileset">FileSet</a>. These are subclasses of <a href="http://schema.org/DataDownload">DataDownload</a>, so this definition is compatible with the original definition of the distribution property in schema.org.</td>
511511
</tr>
512512
</table>
513513

@@ -533,6 +533,15 @@ The Croissant vocabulary also defines the following optional dataset-level attri
533533
<td>"A citation to the dataset itself, or a citation for a publication that describes the dataset. Ideally, citations should be expressed using the <a href="https://www.bibtex.org/">bibtex</a> format.<br>
534534
Note that this is different from <a href="http://schema.org/citation">schema.org/citation</a>, which is used to make a citation to another publication from this dataset.
535535
</td>
536+
</tr>
537+
<tr>
538+
<td>sdVersion</a></td>
539+
<td>
540+
<a href="http://schema.org/Number">Number</a><br>
541+
<a href="http://schema.org/Text">Text</a>
542+
</td>
543+
<td>ONE</td>
544+
<td>The version of the dataset <i>metadata</i>, which may be distinct from the version of the dataset <i>content</i>. This property is modeled after schema.org's <a href="http://schema.org/sdLicense">sdLicense</a> and <a href="http://schema.org/sdPublisher">sdPublisher</a>, and may move to schema.org in the future.</td>
536545
</tr>
537546
</table>
538547

@@ -1148,11 +1157,10 @@ Sometimes, not all the data from the source is needed, but only a subset. The `E
11481157

11491158
Croissant supports a few simple transformations that can be applied on the source data:
11501159

1151-
- delimiter: split a string into an array using the supplied character.
1160+
- separator: split a string into an array using the supplied character.
11521161
- readLines: read the content of the file line by line.
11531162
- unArchive: extract the content of the archive. True by default for archive file types (zip, tgz, etc.).
1154-
- regex: A regular expression to parse the data.
1155-
- jsonPath: A JSON path to evaluate on the (JSON) data source.
1163+
- regex: A regular expression to parse the data, with one capture group that corresponds to the output value.
11561164

11571165
For example, to extract information from a filename using a regular expression, we can write:
11581166

@@ -2136,9 +2144,133 @@ For example, consider a dataset where each image is labeled by a different human
21362144

21372145
In this example, the `labeled_images/label` field has an annotation `labeled_images/label/annotator`. The `equivalentProperty` "prov:wasAttributedTo" on the annotation field indicates that each label is attributed to the corresponding person. The person's details (id, gender, age) are pulled from the same source file (`annotations.csv`) on a row-by-row basis. The `gender` and `age` fields are mapped to their corresponding FOAF properties, `foaf:gender` and `foaf:age`, via `equivalentProperty`.
21382146

2139-
### Data Use Restrictions
2147+
### Data Use Conditions
2148+
2149+
Datasets often come with restrictions on how they can be used, particularly in sensitive domains, such as healthcare. Representing these restrictions in a machine-readable format enables automated discovery and compliance checking. For instance, a healthcare dataset might be restricted to non-commercial research use only, or require specific ethics approval.
2150+
2151+
Data use conditions can be attached to a dataset as a whole, or part of a dataset using [sc:usageInfo](http://schema.org/usageInfo) (an existing attribute of schema.org).
2152+
2153+
### Using DUO to Represent Data Use Conditions
2154+
2155+
The [DUO](http://purl.obolibrary.org/obo/duo.owl) ontology provides a set of terms that can be used to represent data use conditions in a machine-readable format. DUO is prevalent in the healthcare domain. Other vocabularies may be used in other verticals.
2156+
2157+
To connect with terms from an external vocabulary, Croissant uses the [sc:DefinedTerm](http://schema.org/DefinedTerm) type, which is a schema.org type designed for that purpose.
2158+
2159+
Here is an example that shows how to use the DUO term [DUO_0000042](http://purl.obolibrary.org/obo/DUO_0000042) to represent the data use condition "General Research Use":
2160+
2161+
```json
2162+
{
2163+
"@context": {
2164+
"@vocab": "https://schema.org/",
2165+
"cr": "http://mlcommons.org/croissant/",
2166+
"duo": "http://purl.obolibrary.org/obo/DUO_"
2167+
},
2168+
"@type": "Dataset",
2169+
"name": "Global Health Imagery Dataset",
2170+
"description": "A dataset of public health imagery for research purposes.",
2171+
"url": "https://example.org/dataset/global-health-1",
2172+
"usageInfo": [
2173+
{
2174+
"@type": "DefinedTerm",
2175+
"name": "General Research Use",
2176+
"termCode": "DUO_0000042",
2177+
"url": "duo:0000042"
2178+
}
2179+
]
2180+
}
2181+
```
2182+
2183+
### **Fine-Grained Control with ODRL**
2184+
2185+
To represent more complex restrictions, such as hierarchical permissions and modifiers, Croissant recommends using [ODRL](https://www.w3.org/TR/odrl-model/), a W3C standard that provides a rich framework for representing permissions and restrictions
2186+
2187+
To use ODRL in Croissant, `sc:usageInfo` is used as a container for an `odrl:Offer`, which represents a set of permissions. `odrl:action` represents the permission, and `odrl:constraint` represents modifiers.
2188+
2189+
The following example shows how to combine DUO and ODRL to represent a data use policy that allows General Research Use ([DUO_0000042](http://purl.obolibrary.org/obo/DUO_0000042)), but only for non-commercial purposes ([DUO_0000018](http://purl.obolibrary.org/obo/DUO_0000018)):
2190+
2191+
```json
2192+
{
2193+
"@context": {
2194+
"@vocab": "https://schema.org/",
2195+
"cr": "http://mlcommons.org/croissant/",
2196+
"duo": "http://purl.obolibrary.org/obo/DUO_",
2197+
"odrl": "http://www.w3.org/ns/odrl/2/"
2198+
},
2199+
"@type": "Dataset",
2200+
"name": "Restricted Health Data",
2201+
2202+
"usageInfo": {
2203+
"@type": ["CreativeWork", "odrl:Offer"],
2204+
"name": "DUO Usage Policy",
2205+
2206+
"odrl:permission": {
2207+
"@type": "odrl:Permission",
2208+
"odrl:action": {
2209+
"@id": "duo:0000006",
2210+
"name": "Health or Medical or Biomedical Use"
2211+
},
2212+
"odrl:constraint": [
2213+
{
2214+
"@type": "odrl:Constraint",
2215+
"name": "Non-commercial use only",
2216+
"odrl:operator": { "@id": "odrl:eq" },
2217+
"odrl:rightOperand": { "@id": "duo:0000018" }
2218+
}
2219+
]
2220+
}
2221+
2222+
}
2223+
}
2224+
```
2225+
2226+
### Integration with Domain-Specific Ontologies
21402227

2141-
TODO: Add guidance on representing data use restrictions.
2228+
In the health domain, it is often necessary to specify that a dataset can only be used for research on a specific disease. DUO recommends using the [MONDO](https://mondo.monarchinitiative.org/) ontology to specify disease-specific restrictions.
2229+
2230+
The example below shows how to use MONDO in combination with DUO and ODRL to specify that a dataset can only be used for research on Alzheimer's disease ([MONDO_0005070](http://purl.obolibrary.org/obo/MONDO_0005070)).
2231+
2232+
```json
2233+
{
2234+
"@context": {
2235+
"@vocab": "https://schema.org/",
2236+
"cr": "http://mlcommons.org/croissant/",
2237+
"duo": "http://purl.obolibrary.org/obo/DUO_",
2238+
"mondo": "http://purl.obolibrary.org/obo/MONDO_",
2239+
"odrl": "http://www.w3.org/ns/odrl/2/"
2240+
},
2241+
"@type": "Dataset",
2242+
"name": "Restricted Health Data",
2243+
2244+
"usageInfo": {
2245+
"@type": ["CreativeWork", "odrl:Offer"],
2246+
"name": "DUO Usage Policy",
2247+
2248+
"odrl:permission": {
2249+
"@type": "odrl:Permission",
2250+
"odrl:action": {
2251+
"@id": "duo:0000007",
2252+
"name": "Disease specific research"
2253+
},
2254+
"odrl:constraint": [
2255+
{
2256+
"@type": "odrl:Constraint",
2257+
"name": "Non-commercial use only",
2258+
"odrl:operator": { "@id": "odrl:eq" },
2259+
"odrl:rightOperand": { "@id": "duo:0000018" }
2260+
},
2261+
{
2262+
"@type": "odrl:Constraint",
2263+
"odrl:leftOperand": { "@id": "duo:0000010"},
2264+
"odrl:operator": { "@id": "odrl:eq" },
2265+
"odrl:rightOperand": { "@id": "mondo:0005070" }
2266+
}
2267+
]
2268+
}
2269+
}
2270+
}
2271+
```
2272+
2273+
This approach can be extended to other domain-specific ontologies.
21422274

21432275
## Appendix 1: JSON-LD context
21442276

@@ -2150,6 +2282,7 @@ TODO: Add guidance on representing data use restrictions.
21502282
"cr": "http://mlcommons.org/croissant/",
21512283
"rai": "http://mlcommons.org/croissant/RAI/",
21522284
"dct": "http://purl.org/dc/terms/",
2285+
"annotation": "cr:annotation",
21532286
"arrayShape": "cr:arrayShape",
21542287
"citeAs": "cr:citeAs",
21552288
"column": "cr:column",
@@ -2163,10 +2296,13 @@ TODO: Add guidance on representing data use restrictions.
21632296
"@id": "cr:dataType",
21642297
"@type": "@vocab"
21652298
},
2299+
"separator": "cr:separator",
2300+
"equivalentProperty": "cr:equivalentProperty",
21662301
"examples": {
21672302
"@id": "cr:examples",
21682303
"@type": "@json"
21692304
},
2305+
"excludes": "cr:excludes",
21702306
"extract": "cr:extract",
21712307
"field": "cr:field",
21722308
"fileProperty": "cr:fileProperty",
@@ -2180,14 +2316,16 @@ TODO: Add guidance on representing data use restrictions.
21802316
"key": "cr:key",
21812317
"md5": "cr:md5",
21822318
"parentField": "cr:parentField",
2183-
"path": "cr:path",
21842319
"recordSet": "cr:recordSet",
21852320
"references": "cr:references",
21862321
"regex": "cr:regex",
2187-
"replace": "cr:replace",
2322+
"readLines": "cr:readLines",
2323+
"sdVersion": "cr:sdVersion",
21882324
"separator": "cr:separator",
21892325
"source": "cr:source",
21902326
"subField": "cr:subField",
2191-
"transform": "cr:transform"
2327+
"transform": "cr:transform",
2328+
"unArchive": "cr:unArchive",
2329+
"value": "cr:value",
21922330
}
21932331
```

docs/croissant.ttl

Lines changed: 39 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -8,12 +8,12 @@
88
croissant:FileObject a rdf:Class ;
99
rdfs:label "FileObject" ;
1010
rdfs:comment "An individual file that is part of a dataset." ;
11-
rdfs:subClassOf schema:CreativeWork .
11+
rdfs:subClassOf schema:DataDownload .
1212

1313
croissant:FileSet a rdf:Class ;
1414
rdfs:label "FileSet" ;
1515
rdfs:comment "A set of homogeneous files extracted from a container, optionally filtered by inclusion and/or exclusion filters." ;
16-
rdfs:subClassOf schema:Intangible .
16+
rdfs:subClassOf schema:DataDownload .
1717

1818
croissant:RecordSet a rdf:Class ;
1919
rdfs:label "RecordSet" ;
@@ -84,21 +84,27 @@ croissant:citeAs a rdf:Property ;
8484
schema:domainIncludes schema:Dataset ;
8585
schema:rangeIncludes schema:Text .
8686

87+
croissant:sdVersion a rdf:Property ;
88+
rdfs:label "sdVersion" ;
89+
rdfs:comment "The version of the dataset metadata, which may be distinct from the version of the dataset content." ;
90+
schema:domainIncludes schema:Dataset ;
91+
schema:rangeIncludes schema:Number, schema:Text .
92+
8793
# FileObject & FileSet properties
8894

8995
croissant:containedIn a rdf:Property ;
9096
rdfs:label "containedIn" ;
91-
rdfs:comment "Another FileObject or FileSet that this one is contained in, e.g., in the case of a file extracted from an archive. When this property is present, the contentUrl is evaluated as a relative path within the container object." ;
97+
rdfs:comment "Another FileObject, FileSet or DataSource that this one is contained in, e.g., in the case of a file extracted from an archive. When this property is present, the contentUrl is evaluated as a relative path within the container object." ;
9298
schema:domainIncludes croissant:FileObject, croissant:FileSet ;
93-
schema:rangeIncludes croissant:FileObject, croissant:FileSet .
99+
schema:rangeIncludes croissant:FileObject, croissant:FileSet, croissant:DataSource .
94100

95-
croissant:includes a rdf:Property ; # Should this be named includePattern instead?
101+
croissant:includes a rdf:Property ;
96102
rdfs:label "includes" ;
97103
rdfs:comment "A glob pattern that specifies the files to include, e.g., \".jpg\", \"/foo/pic*.jpg\". The pattern is evaluated from the root of the containedIn contents." ;
98104
schema:domainIncludes croissant:FileSet ;
99105
schema:rangeIncludes schema:Text .
100106

101-
croissant:excludes a rdf:Property ; # Should this be named excludePattern instead?
107+
croissant:excludes a rdf:Property ;
102108
rdfs:label "excludes" ;
103109
rdfs:comment "A glob pattern that specifies the files to exclude. The pattern is evaluated from the root of the containedIn contents, after the includes patterns have been evaluated." ;
104110
schema:domainIncludes croissant:FileSet ;
@@ -130,6 +136,12 @@ croissant:examples a rdf:Property ;
130136
schema:domainIncludes croissant:RecordSet ;
131137
schema:rangeIncludes rdf:JSON .
132138

139+
croissant:annotation a rdf:Property ;
140+
rdfs:label "annotation" ;
141+
rdfs:comment "One or more data-level annotations that apply to the entire record or field." ;
142+
schema:domainIncludes croissant:RecordSet, croissant:Field ;
143+
schema:rangeIncludes croissant:Field .
144+
133145
croissant:source a rdf:Property ;
134146
rdfs:label "source" ;
135147
rdfs:comment "The data source of the field. This will generally reference a FileObject or FileSet's contents (e.g., a specific column of a table)." ;
@@ -142,6 +154,12 @@ croissant:dataType a rdf:Property ;
142154
schema:domainIncludes croissant:RecordSet, croissant:Field ;
143155
schema:rangeIncludes croissant:DataType .
144156

157+
croissant:value a rdf:Property ;
158+
rdfs:label "value" ;
159+
rdfs:comment "An optional constant value for the field." ;
160+
schema:domainIncludes croissant:Field ;
161+
schema:rangeIncludes rdf:JSON .
162+
145163
croissant:repeated a rdf:Property ;
146164
rdfs:label "repeated" ;
147165
rdfs:comment "If true, then the Field is a list of values of type dataType." ;
@@ -238,24 +256,30 @@ croissant:jsonPath a rdf:Property ;
238256

239257
# Transform properties
240258

241-
croissant:delimiter a rdf:Property ;
242-
rdfs:label "delimiter" ;
243-
rdfs:comment "A delimiter to use parse the data into an array." ;
259+
croissant:separator a rdf:Property ;
260+
rdfs:label "separator" ;
261+
rdfs:comment "A separator to use parse the data into an array." ;
244262
schema:domainIncludes croissant:Transform ;
245263
schema:rangeIncludes schema:Text .
246264

265+
croissant:readLines a rdf:Property ;
266+
rdfs:label "readLines" ;
267+
rdfs:comment "Read the content of the file line by line." ;
268+
schema:domainIncludes croissant:Transform ;
269+
schema:rangeIncludes schema:Boolean .
270+
271+
croissant:unArchive a rdf:Property ;
272+
rdfs:label "unArchive" ;
273+
rdfs:comment "Extract the content of the archive." ;
274+
schema:domainIncludes croissant:Transform ;
275+
schema:rangeIncludes schema:Boolean .
276+
247277
croissant:regex a rdf:Property ;
248278
rdfs:label "regex" ;
249279
rdfs:comment "A regular expression to apply to the data." ;
250280
schema:domainIncludes croissant:Transform ;
251281
schema:rangeIncludes schema:Text .
252282

253-
croissant:jsonQuery a rdf:Property ;
254-
rdfs:label "jsonQuery" ;
255-
rdfs:comment "For JSON content, a query to evaluate on the data." ;
256-
schema:domainIncludes croissant:Transform ;
257-
schema:rangeIncludes schema:Text .
258-
259283
### ML-specific definitions
260284

261285
croissant:Split a rdf:class ;

0 commit comments

Comments
 (0)