mlcommons · benjelloun · Dec 5, 2025 · Dec 2, 2025 · Dec 3, 2025 · Dec 3, 2025
@@ -491,7 +491,7 @@ Other properties from [schema.org/Dataset](http://schema.org/Dataset) or its par
 
 ### Modified and Added Properties
 
-Croissant modifies the meaning of one [schema.org](http://schema.org) property, and requires its presence:
+Croissant modifies the meaning of one [schema.org](http://schema.org) property, and makes it required:
 
 <table>
   <thead>
@@ -507,7 +507,7 @@ Croissant modifies the meaning of one [schema.org](http://schema.org) property,
       <a href="#fileset">FileSet</a>
     </td>
     <td>MANY</td>
-    <td>By contrast with <a href="http://schema.org/Dataset">schema.org/Dataset</a>, Croissant requires the distribution property to have values of type FileObject or FileSet.</td>
+    <td>By contrast with <a href="http://schema.org/Dataset">schema.org/Dataset</a>, Croissant requires the distribution property to have values of type <a href="#fileobject">FileObject</a> or <a href="#fileset">FileSet</a>. These are subclasses of <a href="http://schema.org/DataDownload">DataDownload</a>, so this definition is compatible with the original definition of the distribution property in schema.org.</td>
   </tr>
 </table>
 
@@ -533,6 +533,15 @@ The Croissant vocabulary also defines the following optional dataset-level attri
     <td>"A citation to the dataset itself, or a citation for a publication that describes the dataset. Ideally, citations should be expressed using the <a href="https://www.bibtex.org/">bibtex</a> format.<br>
     Note that this is different from <a href="http://schema.org/citation">schema.org/citation</a>, which is used to make a citation to another publication from this dataset.
     </td>
+  </tr>
+     <tr>
+    <td>sdVersion</a></td>
+    <td>
+      <a href="http://schema.org/Number">Number</a><br>
+      <a href="http://schema.org/Text">Text</a>
+    </td>
+    <td>ONE</td>
+    <td>The version of the dataset <i>metadata</i>, which may be distinct from the version of the dataset <i>content</i>. This property is modeled after schema.org's <a href="http://schema.org/sdLicense">sdLicense</a> and <a href="http://schema.org/sdPublisher">sdPublisher</a>, and may move to schema.org in the future.</td>
   </tr>
 </table>
 
@@ -1148,11 +1157,10 @@ Sometimes, not all the data from the source is needed, but only a subset. The `E
 
 Croissant supports a few simple transformations that can be applied on the source data:
 
-- delimiter: split a string into an array using the supplied character.
+- separator: split a string into an array using the supplied character.
 - readLines: read the content of the file line by line.
 - unArchive: extract the content of the archive. True by default for archive file types (zip, tgz, etc.).
-- regex: A regular expression to parse the data.
-- jsonPath: A JSON path to evaluate on the (JSON) data source.
+- regex: A regular expression to parse the data, with one capture group that corresponds to the output value.
 
 For example, to extract information from a filename using a regular expression, we can write:
 
@@ -2136,9 +2144,133 @@ For example, consider a dataset where each image is labeled by a different human
 
 In this example, the `labeled_images/label` field has an annotation `labeled_images/label/annotator`. The `equivalentProperty` "prov:wasAttributedTo" on the annotation field indicates that each label is attributed to the corresponding person. The person's details (id, gender, age) are pulled from the same source file (`annotations.csv`) on a row-by-row basis. The `gender` and `age` fields are mapped to their corresponding FOAF properties, `foaf:gender` and `foaf:age`, via `equivalentProperty`.
 
-### Data Use Restrictions
+### Data Use Conditions
+
+Datasets often come with restrictions on how they can be used, particularly in sensitive domains, such as healthcare. Representing these restrictions in a machine-readable format enables automated discovery and compliance checking. For instance, a healthcare dataset might be restricted to non-commercial research use only, or require specific ethics approval.
+
+Data use conditions can be attached to a dataset as a whole, or part of a dataset using [sc:usageInfo](http://schema.org/usageInfo) (an existing attribute of schema.org).
+
+### Using DUO to Represent Data Use Conditions
+
+The [DUO](http://purl.obolibrary.org/obo/duo.owl) ontology provides a set of terms that can be used to represent data use conditions in a machine-readable format. DUO is prevalent in the healthcare domain. Other vocabularies may be used in other verticals.
+
+To connect with terms from an external vocabulary, Croissant uses the [sc:DefinedTerm](http://schema.org/DefinedTerm) type, which is a schema.org type designed for that purpose.
+
+Here is an example that shows how to use the DUO term [DUO_0000042](http://purl.obolibrary.org/obo/DUO_0000042) to represent the data use condition "General Research Use":
+
+```json
+{
+  "@context": {
+    "@vocab": "https://schema.org/",
+    "cr": "http://mlcommons.org/croissant/",
+    "duo": "http://purl.obolibrary.org/obo/DUO_"
+  },
+  "@type": "Dataset",
+  "name": "Global Health Imagery Dataset",
+  "description": "A dataset of public health imagery for research purposes.",
+  "url": "https://example.org/dataset/global-health-1",
+  "usageInfo": [
+    {
+      "@type": "DefinedTerm",
+      "name": "General Research Use",
+      "termCode": "DUO_0000042",
+      "url": "duo:0000042"
+    }
+  ]
+}
+```
+
+### **Fine-Grained Control with ODRL**
+
+To represent more complex restrictions, such as hierarchical permissions and modifiers, Croissant recommends using [ODRL](https://www.w3.org/TR/odrl-model/), a W3C standard that provides a rich framework for representing permissions and restrictions
+
+To use ODRL in Croissant, `sc:usageInfo` is used as a container for an `odrl:Offer`, which represents a set of permissions. `odrl:action` represents the permission, and `odrl:constraint` represents modifiers.
+
+The following example shows how to combine DUO and ODRL to represent a data use policy that allows General  Research Use ([DUO_0000042](http://purl.obolibrary.org/obo/DUO_0000042)), but only for non-commercial purposes ([DUO_0000018](http://purl.obolibrary.org/obo/DUO_0000018)):
+
+```json
+{
+  "@context": {
+    "@vocab": "https://schema.org/",
+    "cr": "http://mlcommons.org/croissant/",
+    "duo": "http://purl.obolibrary.org/obo/DUO_",
+    "odrl": "http://www.w3.org/ns/odrl/2/"
+  },
+  "@type": "Dataset",
+  "name": "Restricted Health Data",
+
+  "usageInfo": {
+    "@type": ["CreativeWork", "odrl:Offer"],
+    "name": "DUO Usage Policy",
+
+    "odrl:permission": {
+      "@type": "odrl:Permission",
+      "odrl:action": {
+        "@id": "duo:0000006",
+        "name": "Health or Medical or Biomedical Use"
+      },
+      "odrl:constraint": [
+        {
+          "@type": "odrl:Constraint",
+           "name": "Non-commercial use only",
+          "odrl:operator": { "@id": "odrl:eq" },
+          "odrl:rightOperand": { "@id": "duo:0000018" }
+        }
+      ]
+    }
+
+  }
+}
+```
+
+### Integration with Domain-Specific Ontologies
 
-TODO: Add guidance on representing data use restrictions.
+In the health domain, it is often necessary to specify that a dataset can only be used for research on a specific disease. DUO recommends using the [MONDO](https://mondo.monarchinitiative.org/) ontology to specify disease-specific restrictions. 
+
+The example below shows how to use MONDO in combination with DUO and ODRL to specify that a dataset can only be used for research on Alzheimer's disease ([MONDO_0005070](http://purl.obolibrary.org/obo/MONDO_0005070)).
+
+```json
+{
+  "@context": {
+    "@vocab": "https://schema.org/",
+    "cr": "http://mlcommons.org/croissant/",
+    "duo": "http://purl.obolibrary.org/obo/DUO_",
+    "mondo": "http://purl.obolibrary.org/obo/MONDO_",
+    "odrl": "http://www.w3.org/ns/odrl/2/"
+  },
+  "@type": "Dataset",
+  "name": "Restricted Health Data",
+
+  "usageInfo": {
+    "@type": ["CreativeWork", "odrl:Offer"], 
+    "name": "DUO Usage Policy",
+
+    "odrl:permission": {
+      "@type": "odrl:Permission",
+      "odrl:action": {
+        "@id": "duo:0000007",
+        "name": "Disease specific research"
+      },
+      "odrl:constraint": [
+        {
+          "@type": "odrl:Constraint",
+          "name": "Non-commercial use only",
+          "odrl:operator": { "@id": "odrl:eq" },
+          "odrl:rightOperand": { "@id": "duo:0000018" }
+        },
+        {
+           "@type": "odrl:Constraint",
+           "odrl:leftOperand": { "@id": "duo:0000010"},
+           "odrl:operator": { "@id": "odrl:eq" },
+           "odrl:rightOperand": { "@id": "mondo:0005070" }
+        }
+      ]
+    }
+  }
+}
+```
+
+This approach can be extended to other domain-specific ontologies.
 
 ## Appendix 1: JSON-LD context
 
@@ -2150,6 +2282,7 @@ TODO: Add guidance on representing data use restrictions.
     "cr": "http://mlcommons.org/croissant/",
     "rai": "http://mlcommons.org/croissant/RAI/",
     "dct": "http://purl.org/dc/terms/",
+    "annotation": "cr:annotation",
     "arrayShape": "cr:arrayShape",
     "citeAs": "cr:citeAs",
     "column": "cr:column",
@@ -2163,10 +2296,13 @@ TODO: Add guidance on representing data use restrictions.
       "@id": "cr:dataType",
       "@type": "@vocab"
     },
+    "separator": "cr:separator",
+    "equivalentProperty": "cr:equivalentProperty",
     "examples": {
       "@id": "cr:examples",
       "@type": "@json"
     },
+    "excludes": "cr:excludes",
     "extract": "cr:extract",
     "field": "cr:field",
     "fileProperty": "cr:fileProperty",
@@ -2180,14 +2316,16 @@ TODO: Add guidance on representing data use restrictions.
     "key": "cr:key",
     "md5": "cr:md5",
     "parentField": "cr:parentField",
-    "path": "cr:path",
     "recordSet": "cr:recordSet",
     "references": "cr:references",
     "regex": "cr:regex",
-    "replace": "cr:replace",
+    "readLines": "cr:readLines",
+    "sdVersion": "cr:sdVersion",
     "separator": "cr:separator",
     "source": "cr:source",
     "subField": "cr:subField",
-    "transform": "cr:transform"
+    "transform": "cr:transform",
+    "unArchive": "cr:unArchive",
+    "value": "cr:value",
   }
 ```
@@ -8,12 +8,12 @@
 croissant:FileObject a rdf:Class ;
   rdfs:label "FileObject" ;
   rdfs:comment "An individual file that is part of a dataset." ;
-  rdfs:subClassOf schema:CreativeWork .
+  rdfs:subClassOf schema:DataDownload .
 
 croissant:FileSet a rdf:Class ;
   rdfs:label "FileSet" ;
   rdfs:comment "A set of homogeneous files extracted from a container, optionally filtered by inclusion and/or exclusion filters." ;
-  rdfs:subClassOf schema:Intangible .
+  rdfs:subClassOf schema:DataDownload .
 
 croissant:RecordSet a rdf:Class ;
   rdfs:label "RecordSet" ;
@@ -84,21 +84,27 @@ croissant:citeAs a rdf:Property ;
   schema:domainIncludes schema:Dataset ;
   schema:rangeIncludes schema:Text .
 
+croissant:sdVersion a rdf:Property ;
+  rdfs:label "sdVersion" ;
+  rdfs:comment "The version of the dataset metadata, which may be distinct from the version of the dataset content." ;
+  schema:domainIncludes schema:Dataset ;
+  schema:rangeIncludes schema:Number, schema:Text .
+
 # FileObject & FileSet properties
 
 croissant:containedIn a rdf:Property ;
   rdfs:label "containedIn" ;
-  rdfs:comment "Another FileObject or FileSet that this one is contained in, e.g., in the case of a file extracted from an archive. When this property is present, the contentUrl is evaluated as a relative path within the container object." ;
+  rdfs:comment "Another FileObject, FileSet or DataSource that this one is contained in, e.g., in the case of a file extracted from an archive. When this property is present, the contentUrl is evaluated as a relative path within the container object." ;
   schema:domainIncludes croissant:FileObject, croissant:FileSet ;
-  schema:rangeIncludes croissant:FileObject, croissant:FileSet .
+  schema:rangeIncludes croissant:FileObject, croissant:FileSet, croissant:DataSource .
 
-croissant:includes a rdf:Property ; # Should this be named includePattern instead?
+croissant:includes a rdf:Property ;
   rdfs:label "includes" ;
   rdfs:comment "A glob pattern that specifies the files to include, e.g., \".jpg\", \"/foo/pic*.jpg\". The pattern is evaluated from the root of the containedIn contents." ;
   schema:domainIncludes croissant:FileSet ;
   schema:rangeIncludes schema:Text .
 
-croissant:excludes a rdf:Property ; # Should this be named excludePattern instead?
+croissant:excludes a rdf:Property ;
   rdfs:label "excludes" ;
   rdfs:comment "A glob pattern that specifies the files to exclude. The pattern is evaluated from the root of the containedIn contents, after the includes patterns have been evaluated." ;
   schema:domainIncludes croissant:FileSet ;
@@ -130,6 +136,12 @@ croissant:examples a rdf:Property ;
   schema:domainIncludes croissant:RecordSet ;
   schema:rangeIncludes rdf:JSON .
 
+croissant:annotation a rdf:Property ;
+  rdfs:label "annotation" ;
+  rdfs:comment "One or more data-level annotations that apply to the entire record or field." ;
+  schema:domainIncludes croissant:RecordSet, croissant:Field ;
+  schema:rangeIncludes croissant:Field .
+
 croissant:source a rdf:Property ;
   rdfs:label "source" ;
   rdfs:comment "The data source of the field. This will generally reference a FileObject or FileSet's contents (e.g., a specific column of a table)." ;
@@ -142,6 +154,12 @@ croissant:dataType a rdf:Property ;
   schema:domainIncludes croissant:RecordSet, croissant:Field ;
   schema:rangeIncludes croissant:DataType .
 
+croissant:value a rdf:Property ;
+  rdfs:label "value" ;
+  rdfs:comment "An optional constant value for the field." ;
+  schema:domainIncludes croissant:Field ;
+  schema:rangeIncludes rdf:JSON .
+
 croissant:repeated a rdf:Property ;
   rdfs:label "repeated" ;
   rdfs:comment "If true, then the Field is a list of values of type dataType." ;
@@ -238,24 +256,30 @@ croissant:jsonPath a rdf:Property ;
 
 # Transform properties
 
-croissant:delimiter a rdf:Property ;
-  rdfs:label "delimiter" ;
-  rdfs:comment "A delimiter to use parse the data into an array." ;
+croissant:separator a rdf:Property ;
+  rdfs:label "separator" ;
+  rdfs:comment "A separator to use parse the data into an array." ;
   schema:domainIncludes croissant:Transform ;
   schema:rangeIncludes schema:Text .
 
+croissant:readLines a rdf:Property ;
+  rdfs:label "readLines" ;
+  rdfs:comment "Read the content of the file line by line." ;
+  schema:domainIncludes croissant:Transform ;
+  schema:rangeIncludes schema:Boolean .
+
+croissant:unArchive a rdf:Property ;
+  rdfs:label "unArchive" ;
+  rdfs:comment "Extract the content of the archive." ;
+  schema:domainIncludes croissant:Transform ;
+  schema:rangeIncludes schema:Boolean .
+
 croissant:regex a rdf:Property ;
   rdfs:label "regex" ;
   rdfs:comment "A regular expression to apply to the data." ;
   schema:domainIncludes croissant:Transform ;
   schema:rangeIncludes schema:Text .
 
-croissant:jsonQuery a rdf:Property ;
-  rdfs:label "jsonQuery" ;
-  rdfs:comment "For JSON content, a query to evaluate on the data." ;
-  schema:domainIncludes croissant:Transform ;
-  schema:rangeIncludes schema:Text .
-
 ### ML-specific definitions
 
 croissant:Split a rdf:class ;