You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -491,7 +491,7 @@ Other properties from [schema.org/Dataset](http://schema.org/Dataset) or its par
491
491
492
492
### Modified and Added Properties
493
493
494
-
Croissant modifies the meaning of one [schema.org](http://schema.org) property, and requires its presence:
494
+
Croissant modifies the meaning of one [schema.org](http://schema.org) property, and makes it required:
495
495
496
496
<table>
497
497
<thead>
@@ -507,7 +507,7 @@ Croissant modifies the meaning of one [schema.org](http://schema.org) property,
507
507
<a href="#fileset">FileSet</a>
508
508
</td>
509
509
<td>MANY</td>
510
-
<td>By contrast with <a href="http://schema.org/Dataset">schema.org/Dataset</a>, Croissant requires the distribution property to have values of type FileObject or FileSet.</td>
510
+
<td>By contrast with <a href="http://schema.org/Dataset">schema.org/Dataset</a>, Croissant requires the distribution property to have values of type <a href="#fileobject">FileObject</a> or <a href="#fileset">FileSet</a>. These are subclasses of <a href="http://schema.org/DataDownload">DataDownload</a>, so this definition is compatible with the original definition of the distribution property in schema.org.</td>
511
511
</tr>
512
512
</table>
513
513
@@ -533,6 +533,15 @@ The Croissant vocabulary also defines the following optional dataset-level attri
533
533
<td>"A citation to the dataset itself, or a citation for a publication that describes the dataset. Ideally, citations should be expressed using the <a href="https://www.bibtex.org/">bibtex</a> format.<br>
534
534
Note that this is different from <a href="http://schema.org/citation">schema.org/citation</a>, which is used to make a citation to another publication from this dataset.
535
535
</td>
536
+
</tr>
537
+
<tr>
538
+
<td>sdVersion</a></td>
539
+
<td>
540
+
<a href="http://schema.org/Number">Number</a><br>
541
+
<a href="http://schema.org/Text">Text</a>
542
+
</td>
543
+
<td>ONE</td>
544
+
<td>The version of the dataset <i>metadata</i>, which may be distinct from the version of the dataset <i>content</i>. This property is modeled after schema.org's <a href="http://schema.org/sdLicense">sdLicense</a> and <a href="http://schema.org/sdPublisher">sdPublisher</a>, and may move to schema.org in the future.</td>
536
545
</tr>
537
546
</table>
538
547
@@ -1148,11 +1157,10 @@ Sometimes, not all the data from the source is needed, but only a subset. The `E
1148
1157
1149
1158
Croissant supports a few simple transformations that can be applied on the source data:
1150
1159
1151
-
-delimiter: split a string into an array using the supplied character.
1160
+
-separator: split a string into an array using the supplied character.
1152
1161
- readLines: read the content of the file line by line.
1153
1162
- unArchive: extract the content of the archive. True by default for archive file types (zip, tgz, etc.).
1154
-
- regex: A regular expression to parse the data.
1155
-
- jsonPath: A JSON path to evaluate on the (JSON) data source.
1163
+
- regex: A regular expression to parse the data, with one capture group that corresponds to the output value.
1156
1164
1157
1165
For example, to extract information from a filename using a regular expression, we can write:
1158
1166
@@ -2136,9 +2144,133 @@ For example, consider a dataset where each image is labeled by a different human
2136
2144
2137
2145
In this example, the `labeled_images/label` field has an annotation `labeled_images/label/annotator`. The `equivalentProperty` "prov:wasAttributedTo" on the annotation field indicates that each label is attributed to the corresponding person. The person's details (id, gender, age) are pulled from the same source file (`annotations.csv`) on a row-by-row basis. The `gender` and `age` fields are mapped to their corresponding FOAF properties, `foaf:gender` and `foaf:age`, via `equivalentProperty`.
2138
2146
2139
-
### Data Use Restrictions
2147
+
### Data Use Conditions
2148
+
2149
+
Datasets often come with restrictions on how they can be used, particularly in sensitive domains, such as healthcare. Representing these restrictions in a machine-readable format enables automated discovery and compliance checking. For instance, a healthcare dataset might be restricted to non-commercial research use only, or require specific ethics approval.
2150
+
2151
+
Data use conditions can be attached to a dataset as a whole, or part of a dataset using [sc:usageInfo](http://schema.org/usageInfo) (an existing attribute of schema.org).
2152
+
2153
+
### Using DUO to Represent Data Use Conditions
2154
+
2155
+
The [DUO](http://purl.obolibrary.org/obo/duo.owl) ontology provides a set of terms that can be used to represent data use conditions in a machine-readable format. DUO is prevalent in the healthcare domain. Other vocabularies may be used in other verticals.
2156
+
2157
+
To connect with terms from an external vocabulary, Croissant uses the [sc:DefinedTerm](http://schema.org/DefinedTerm) type, which is a schema.org type designed for that purpose.
2158
+
2159
+
Here is an example that shows how to use the DUO term [DUO_0000042](http://purl.obolibrary.org/obo/DUO_0000042) to represent the data use condition "General Research Use":
2160
+
2161
+
```json
2162
+
{
2163
+
"@context": {
2164
+
"@vocab": "https://schema.org/",
2165
+
"cr": "http://mlcommons.org/croissant/",
2166
+
"duo": "http://purl.obolibrary.org/obo/DUO_"
2167
+
},
2168
+
"@type": "Dataset",
2169
+
"name": "Global Health Imagery Dataset",
2170
+
"description": "A dataset of public health imagery for research purposes.",
To represent more complex restrictions, such as hierarchical permissions and modifiers, Croissant recommends using [ODRL](https://www.w3.org/TR/odrl-model/), a W3C standard that provides a rich framework for representing permissions and restrictions
2186
+
2187
+
To use ODRL in Croissant, `sc:usageInfo` is used as a container for an `odrl:Offer`, which represents a set of permissions. `odrl:action` represents the permission, and `odrl:constraint` represents modifiers.
2188
+
2189
+
The following example shows how to combine DUO and ODRL to represent a data use policy that allows General Research Use ([DUO_0000042](http://purl.obolibrary.org/obo/DUO_0000042)), but only for non-commercial purposes ([DUO_0000018](http://purl.obolibrary.org/obo/DUO_0000018)):
2190
+
2191
+
```json
2192
+
{
2193
+
"@context": {
2194
+
"@vocab": "https://schema.org/",
2195
+
"cr": "http://mlcommons.org/croissant/",
2196
+
"duo": "http://purl.obolibrary.org/obo/DUO_",
2197
+
"odrl": "http://www.w3.org/ns/odrl/2/"
2198
+
},
2199
+
"@type": "Dataset",
2200
+
"name": "Restricted Health Data",
2201
+
2202
+
"usageInfo": {
2203
+
"@type": ["CreativeWork", "odrl:Offer"],
2204
+
"name": "DUO Usage Policy",
2205
+
2206
+
"odrl:permission": {
2207
+
"@type": "odrl:Permission",
2208
+
"odrl:action": {
2209
+
"@id": "duo:0000006",
2210
+
"name": "Health or Medical or Biomedical Use"
2211
+
},
2212
+
"odrl:constraint": [
2213
+
{
2214
+
"@type": "odrl:Constraint",
2215
+
"name": "Non-commercial use only",
2216
+
"odrl:operator": { "@id": "odrl:eq" },
2217
+
"odrl:rightOperand": { "@id": "duo:0000018" }
2218
+
}
2219
+
]
2220
+
}
2221
+
2222
+
}
2223
+
}
2224
+
```
2225
+
2226
+
### Integration with Domain-Specific Ontologies
2140
2227
2141
-
TODO: Add guidance on representing data use restrictions.
2228
+
In the health domain, it is often necessary to specify that a dataset can only be used for research on a specific disease. DUO recommends using the [MONDO](https://mondo.monarchinitiative.org/) ontology to specify disease-specific restrictions.
2229
+
2230
+
The example below shows how to use MONDO in combination with DUO and ODRL to specify that a dataset can only be used for research on Alzheimer's disease ([MONDO_0005070](http://purl.obolibrary.org/obo/MONDO_0005070)).
2231
+
2232
+
```json
2233
+
{
2234
+
"@context": {
2235
+
"@vocab": "https://schema.org/",
2236
+
"cr": "http://mlcommons.org/croissant/",
2237
+
"duo": "http://purl.obolibrary.org/obo/DUO_",
2238
+
"mondo": "http://purl.obolibrary.org/obo/MONDO_",
2239
+
"odrl": "http://www.w3.org/ns/odrl/2/"
2240
+
},
2241
+
"@type": "Dataset",
2242
+
"name": "Restricted Health Data",
2243
+
2244
+
"usageInfo": {
2245
+
"@type": ["CreativeWork", "odrl:Offer"],
2246
+
"name": "DUO Usage Policy",
2247
+
2248
+
"odrl:permission": {
2249
+
"@type": "odrl:Permission",
2250
+
"odrl:action": {
2251
+
"@id": "duo:0000007",
2252
+
"name": "Disease specific research"
2253
+
},
2254
+
"odrl:constraint": [
2255
+
{
2256
+
"@type": "odrl:Constraint",
2257
+
"name": "Non-commercial use only",
2258
+
"odrl:operator": { "@id": "odrl:eq" },
2259
+
"odrl:rightOperand": { "@id": "duo:0000018" }
2260
+
},
2261
+
{
2262
+
"@type": "odrl:Constraint",
2263
+
"odrl:leftOperand": { "@id": "duo:0000010"},
2264
+
"odrl:operator": { "@id": "odrl:eq" },
2265
+
"odrl:rightOperand": { "@id": "mondo:0005070" }
2266
+
}
2267
+
]
2268
+
}
2269
+
}
2270
+
}
2271
+
```
2272
+
2273
+
This approach can be extended to other domain-specific ontologies.
2142
2274
2143
2275
## Appendix 1: JSON-LD context
2144
2276
@@ -2150,6 +2282,7 @@ TODO: Add guidance on representing data use restrictions.
2150
2282
"cr": "http://mlcommons.org/croissant/",
2151
2283
"rai": "http://mlcommons.org/croissant/RAI/",
2152
2284
"dct": "http://purl.org/dc/terms/",
2285
+
"annotation": "cr:annotation",
2153
2286
"arrayShape": "cr:arrayShape",
2154
2287
"citeAs": "cr:citeAs",
2155
2288
"column": "cr:column",
@@ -2163,10 +2296,13 @@ TODO: Add guidance on representing data use restrictions.
2163
2296
"@id": "cr:dataType",
2164
2297
"@type": "@vocab"
2165
2298
},
2299
+
"separator": "cr:separator",
2300
+
"equivalentProperty": "cr:equivalentProperty",
2166
2301
"examples": {
2167
2302
"@id": "cr:examples",
2168
2303
"@type": "@json"
2169
2304
},
2305
+
"excludes": "cr:excludes",
2170
2306
"extract": "cr:extract",
2171
2307
"field": "cr:field",
2172
2308
"fileProperty": "cr:fileProperty",
@@ -2180,14 +2316,16 @@ TODO: Add guidance on representing data use restrictions.
Copy file name to clipboardExpand all lines: docs/croissant.ttl
+39-15Lines changed: 39 additions & 15 deletions
Original file line number
Diff line number
Diff line change
@@ -8,12 +8,12 @@
8
8
croissant:FileObject a rdf:Class ;
9
9
rdfs:label "FileObject" ;
10
10
rdfs:comment "An individual file that is part of a dataset." ;
11
-
rdfs:subClassOf schema:CreativeWork .
11
+
rdfs:subClassOf schema:DataDownload .
12
12
13
13
croissant:FileSet a rdf:Class ;
14
14
rdfs:label "FileSet" ;
15
15
rdfs:comment "A set of homogeneous files extracted from a container, optionally filtered by inclusion and/or exclusion filters." ;
16
-
rdfs:subClassOf schema:Intangible .
16
+
rdfs:subClassOf schema:DataDownload .
17
17
18
18
croissant:RecordSet a rdf:Class ;
19
19
rdfs:label "RecordSet" ;
@@ -84,21 +84,27 @@ croissant:citeAs a rdf:Property ;
84
84
schema:domainIncludes schema:Dataset ;
85
85
schema:rangeIncludes schema:Text .
86
86
87
+
croissant:sdVersion a rdf:Property ;
88
+
rdfs:label "sdVersion" ;
89
+
rdfs:comment "The version of the dataset metadata, which may be distinct from the version of the dataset content." ;
90
+
schema:domainIncludes schema:Dataset ;
91
+
schema:rangeIncludes schema:Number, schema:Text .
92
+
87
93
# FileObject & FileSet properties
88
94
89
95
croissant:containedIn a rdf:Property ;
90
96
rdfs:label "containedIn" ;
91
-
rdfs:comment "Another FileObjector FileSet that this one is contained in, e.g., in the case of a file extracted from an archive. When this property is present, the contentUrl is evaluated as a relative path within the container object." ;
97
+
rdfs:comment "Another FileObject, FileSet or DataSource that this one is contained in, e.g., in the case of a file extracted from an archive. When this property is present, the contentUrl is evaluated as a relative path within the container object." ;
croissant:includes a rdf:Property ;# Should this be named includePattern instead?
101
+
croissant:includes a rdf:Property ;
96
102
rdfs:label "includes" ;
97
103
rdfs:comment "A glob pattern that specifies the files to include, e.g., \".jpg\", \"/foo/pic*.jpg\". The pattern is evaluated from the root of the containedIn contents." ;
98
104
schema:domainIncludes croissant:FileSet ;
99
105
schema:rangeIncludes schema:Text .
100
106
101
-
croissant:excludes a rdf:Property ;# Should this be named excludePattern instead?
107
+
croissant:excludes a rdf:Property ;
102
108
rdfs:label "excludes" ;
103
109
rdfs:comment "A glob pattern that specifies the files to exclude. The pattern is evaluated from the root of the containedIn contents, after the includes patterns have been evaluated." ;
104
110
schema:domainIncludes croissant:FileSet ;
@@ -130,6 +136,12 @@ croissant:examples a rdf:Property ;
130
136
schema:domainIncludes croissant:RecordSet ;
131
137
schema:rangeIncludes rdf:JSON .
132
138
139
+
croissant:annotation a rdf:Property ;
140
+
rdfs:label "annotation" ;
141
+
rdfs:comment "One or more data-level annotations that apply to the entire record or field." ;
0 commit comments