Adding WildChat-1M, sQuad V2 and, Data Provenance initaitive dataset examples implementing the provenance mechanism. #970

JoanGi · 2025-11-25T10:39:39Z

Summary

Beyond the guidelines provided in the upcoming specification for the provenance mechanism, it would be helpful to include examples of popular datasets that implement this mechanism.

These examples will simplify the process of adapting the mlcroissant library and assist adopters in integrating this feature effectively.

Changes in this PR

WildChat-1M dataset: Enhanced with provenance information.

sQuad V2 dataset: Added a minimal example demonstrating the linkage to the previous version of the sQuad dataset.

Common Pile: Data Provenance Initiative:: Implementing the provenance data-level mechanism.

These additions provide practical references for implementing the provenance mechanism in other datasets.

github-actions · 2025-11-25T10:39:52Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

benjelloun

Thanks for adding these datasets!

Do you have an example with data-level lineage?

datasets/1.1/huggingface-squad_v2/metadata.json

benjelloun · 2025-11-25T10:47:40Z

datasets/1.1/huggingface-squad_v2/metadata.json

+      "usage": [
+        "squad1"
+      ],
+      "isAssociatedWith": [


prov:isAssociatedWith ?

Also, how does this one relate to the prov:isAssociatedWith property below?

That's prov:isAssociatedWith yes!. This property related the Agent with the Activity, while the property below realted the agent with the whole dataset. Updating it!

benjelloun · 2025-11-25T10:50:02Z

datasets/1.1/huggingface-wildchat/metadata.json

+      "@id": "journalismPIIremoval",
+      "prov:description":"Conversations flagged by Niloofar Mireshghallah and her collaborators in 'Breaking News: Case Studies of Generative AI's Use in Journalism' for containing PII or sensitive information have been removed from this version of the dataset.",
+      "prov:startedAtTime": "2024-10-17",
+      "prov:wasAssociatedWith": ["niloofar_mireshghallah"]


Should this be a URL?

or an @id reference?

Can be an URL, and here is an @id reference of a prov:Person declared below. Fixing the @id reference.

benjelloun · 2025-11-25T10:50:10Z

datasets/1.1/huggingface-wildchat/metadata.json

+      "@id": "toxicContentRemoval",
+      "prov:description":"All toxic conversations identified by the OpenAI Moderations API or Detoxify have been removed from this version of the dataset.",
+      "prov:startedAtTime": "2024-07-22",
+      "prov:wasAssociatedWith": ["openai_moderation_api"]


Same question

benjelloun · 2025-11-25T10:50:18Z

datasets/1.1/huggingface-wildchat/metadata.json

+      "@type": "prov:Activity",
+      "@id": "acitivity:PIIremoval",
+      "description":"The data has been de-identified with Microsoft Presidio and hand-written rules by the authors.",
+      "wasAssociatedWith": ["presidio"]


Same question

This PR adds supports for external vocabularies, following #885, #955, and #738. It relies on the datasets kindly provided in #970 for e2e testing.

JoanGi · 2025-11-27T17:00:26Z

Comments fixed - please double-check.

Regarding data-level provenance, I’ve added, as an example, an updated Croissant description of the Data Provenance Initiative Dataset from the Common Pile project (available on Hugging Face), which implements the proposed data-level PROV mechanism.

This dataset is composed of datums from 61 source datasets. At datum or "row" level it includes details about the source dataset and license information. Specifically, it provides a metadata field with subFields that reference the original datasets and their corresponding licenses.

The Field "metadata" of the Croissant description is now annotated, aiming to represent the following logical sequence

metadata is of type prov:Entity, and an equivalent property of "prov:wasDerivedFrom"

The attributes of this prov:Entity are:
  @id -> SubField medata/dataset_id
  prov:atlocation -> SubField metadata/url
  prov:wasAttributedTo -> subField metadata/license_url

I still have doubts about how to declare that the subField metadata/dataset_id is the @id of the prov:Entity. Please take a look!.

This example shows how the mechanism works for pointing to source datasets and thereby captures the data lineage. However, other datasets from the Common Pile, as the Github Archive , Library of Congress, or the Gutenberg project, have the same structure of data-level procenance, poiting to different kind of entities such as code repositories, politicians and laws, or authors and media. I recommend uploading examples using these datasets to demonstrate different uses of the mechanisms. (I can do it in a few days)

benjelloun

Thanks for adding this example Joan!

benjelloun · 2025-12-01T08:49:05Z