Add Responsible AI section with Provenance guidance (#957)

benjelloun · web-flow · commit 8edf4e6c5a06 · 2025-10-28T12:31:24.000+01:00
Draft provenance description for the 1.1 spec.
diff --git a/docs/croissant-spec-draft.md b/docs/croissant-spec-draft.md
@@ -1870,6 +1870,177 @@ Segmentation mask as an image:
 - `sc:GeoShape` describes segmentation masks as a sequence of coordinates (polygon).
 - `sc:ImageObject` describes segmentation masks as image overlays (with pixel = 0 outside of the mask and pixel = 1 inside the mask).
 
+## Responsible AI and Governance
+
+This section provides guidance on how to integrate external vocabularies with Croissant to address important Responsible AI use cases, such as provenance and data use restrictions.
+
+### Provenance Representation
+
+Tracking the provenance of a dataset is crucial for transparency, reproducibility, and responsible AI. It helps users understand where the data came from, how it was created, and how it has been modified over time. This is particularly important for datasets derived from other datasets, or those that have undergone significant transformations, such as filtering, augmentation, or annotation.
+
+Croissant recommends using the [W3C PROV Ontology (PROV-O)](https://www.w3.org/TR/prov-o/) to describe provenance. PROV-O provides a rich and standard vocabulary for describing the entities, activities, and agents involved in the lifecycle of data.
+
+To use PROV-O or other external vocabularies (like FOAF) in a Croissant dataset, you should first declare their namespace in the `@context`. Then, you can use properties from these vocabularies on any Croissant object, such as the Dataset itself, a `FileObject`, a `RecordSet`, or a `Field`.
+
+Key PROV-O relationships include:
+
+*   `prov:wasDerivedFrom`: Indicates that an entity (e.g., the dataset or a part of it) was derived from another entity.
+*   `prov:wasGeneratedBy`: Links an entity to the activity that generated it (e.g., a data cleaning process, a web crawl).
+*   `prov:wasAttributedTo`: Links an entity to the agent responsible for it (e.g., a person, organization, or software).
+
+Provenance can be specified at multiple levels of granularity:
+
+**Dataset and Resource-level Provenance**
+
+You can describe the origin of the entire dataset. For example, if a dataset is a corrupted version of ImageNet:
+
+```json
+{
+  "@context": {
+    "@vocab": "http://schema.org/",
+    "cr": "http://mlcommons.org/croissant/",
+    "prov": "http://www.w3.org/ns/prov#",
+    "foaf": "http://xmlns.com/foaf/0.1/"
+  },
+  "@type": "sc:Dataset",
+  "name": "ImageNet-C",
+  "description": "A variant of ImageNet with applied corruptions.",
+  "prov:wasDerivedFrom": { "@id": "urn:dataset:ImageNet" },
+  "prov:wasGeneratedBy": {
+      "@type": "prov:Activity",
+      "prov:label": "Corruption Transformation"
+  }
+  // ... other dataset properties
+}
+```
+
+Similarly, you can describe the provenance of individual resources (`FileObject` or `FileSet`). For example, to indicate that a file was downloaded from a specific URL by a crawling process:
+
+```json
+{
+  "@type": "cr:FileObject",
+  "@id": "raw_data.csv",
+  "contentUrl": "https://example.com/data.csv",
+  "prov:wasGeneratedBy": {
+      "@type": "prov:Activity",
+      "prov:label": "Web Crawl 2023-10",
+      "prov:endedAtTime": "2023-10-01T12:00:00Z"
+  },
+  "prov:wasAttributedTo": {
+      "@type": "prov:Agent",
+      "prov:label": "Common Crawl Foundation"
+  }
+}
+```
+
+**RecordSet and Field-level Provenance**
+
+Provenance can also be attached to specific `RecordSet`s or `Field`s. This is useful when different parts of the dataset have different origins, or when you want to document the creation of specific annotations.
+
+For example, you can indicate that a set of labels was generated by a specific software agent:
+
+```json
+{
+  "@type": "cr:RecordSet",
+  "@id": "images_with_labels",
+  "field": [
+    {
+      "@type": "cr:Field",
+      "@id": "images_with_labels/image"
+    },
+    {
+      "@type": "cr:Field",
+      "@id": "images_with_labels/label",
+      "dataType": "sc:Text",
+      "prov:wasAttributedTo": {
+        "@type": "prov:Agent",
+        "prov:label": "SyntheticDataGenerator-v1.2"
+      },
+      "prov:wasGeneratedBy": {
+          "@type": "prov:Activity",
+          "prov:label": "Automated Labeling Process"
+      }
+    }
+  ]
+}
+```
+
+**Data-level Provenance**
+
+For the finest level of granularity, you can attach provenance information to individual data values. This is achieved using Croissant's annotation mechanism, where an annotation field is used to hold the provenance information for another field. By setting the `equivalentProperty` of the annotation field to a PROV-O property, you can define the relationship between the data and its provenance.
+
+For example, consider a dataset where each image is labeled by a different human annotator, and we want to capture the information about the annotator for each label. We can combine PROV-O and FOAF (Friend of a Friend) vocabularies to describe this. We can define an annotation field that represents the `prov:Person` (the annotator) and link it to the label field using `prov:wasAttributedTo`. We can then use FOAF properties to describe the person's attributes.
+
+```json
+{
+  "@type": "cr:RecordSet",
+  "@id": "labeled_images",
+  "field": [
+    {
+      "@type": "cr:Field",
+      "@id": "labeled_images/image_id"
+      // ... source definition
+    },
+    {
+      "@type": "cr:Field",
+      "@id": "labeled_images/label",
+      "dataType": ["sc:Text", "cr:Label"],
+      "source": {
+          "fileObject": { "@id": "annotations.csv" },
+          "extract": { "column": "label" }
+      },
+      "annotation": {
+        "@type": "cr:Field",
+        "@id": "labeled_images/label/annotator",
+        "description": "The annotator who created the label.",
+        "dataType": ["prov:Person", "foaf:Person"],
+        "equivalentProperty": "prov:wasAttributedTo",
+        "subField": [
+             {
+                 "@type": "cr:Field",
+                 "@id": "labeled_images/label/annotator/id",
+                 "source": {
+                     "fileObject": { "@id": "annotations.csv" },
+                     "extract": { "column": "annotator_id" }
+                 }
+             },
+             {
+                 "@type": "cr:Field",
+                 "@id": "labeled_images/label/annotator/gender",
+                 "description": "Gender of the annotator.",
+                 "dataType": "sc:Text",
+                 "equivalentProperty": "foaf:gender",
+                 "source": {
+                     "fileObject": { "@id": "annotations.csv" },
+                     "extract": { "column": "annotator_gender" }
+                 }
+             },
+             {
+                 "@type": "cr:Field",
+                 "@id": "labeled_images/label/annotator/age",
+                 "description": "Age of the annotator.",
+                 "dataType": "sc:Integer",
+                 "equivalentProperty": "foaf:age",
+                 "source": {
+                     "fileObject": { "@id": "annotations.csv" },
+                     "extract": { "column": "annotator_age" }
+                 }
+             }
+        ]
+      }
+    }
+  ]
+}
+```
+
+In this example, the `labeled_images/label` field has an annotation `labeled_images/label/annotator`. The `equivalentProperty` "prov:wasAttributedTo" on the annotation field indicates that each label is attributed to the corresponding person. The person's details (id, gender, age) are pulled from the same source file (`annotations.csv`) on a row-by-row basis. The `gender` and `age` fields are mapped to their corresponding FOAF properties, `foaf:gender` and `foaf:age`, via `equivalentProperty`.
+
+By leveraging external vocabularies like PROV-O and FOAF, Croissant enables a standardized and machine-readable way to capture the rich history and context of ML datasets, supporting better trust and understanding.
+
+### Data Use Restrictions
+
+TODO: Add guidance on representing data use restrictions.
+
 ## Appendix 1: JSON-LD context
 
 ```json