Skip to content

Conversation

@JoanGi
Copy link
Contributor

@JoanGi JoanGi commented Nov 25, 2025

Summary

Beyond the guidelines provided in the upcoming specification for the provenance mechanism, it would be helpful to include examples of popular datasets that implement this mechanism.

These examples will simplify the process of adapting the mlcroissant library and assist adopters in integrating this feature effectively.

Changes in this PR

WildChat-1M dataset: Enhanced with provenance information.

sQuad V2 dataset: Added a minimal example demonstrating the linkage to the previous version of the sQuad dataset.

Common Pile: Data Provenance Initiative:: Implementing the provenance data-level mechanism.

These additions provide practical references for implementing the provenance mechanism in other datasets.

@JoanGi JoanGi requested a review from a team as a code owner November 25, 2025 10:39
@github-actions
Copy link

github-actions bot commented Nov 25, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Copy link
Contributor

@benjelloun benjelloun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding these datasets!

Do you have an example with data-level lineage?

"usage": [
"squad1"
],
"isAssociatedWith": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prov:isAssociatedWith ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, how does this one relate to the prov:isAssociatedWith property below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's prov:isAssociatedWith yes!. This property related the Agent with the Activity, while the property below realted the agent with the whole dataset. Updating it!

"@id": "journalismPIIremoval",
"prov:description":"Conversations flagged by Niloofar Mireshghallah and her collaborators in 'Breaking News: Case Studies of Generative AI's Use in Journalism' for containing PII or sensitive information have been removed from this version of the dataset.",
"prov:startedAtTime": "2024-10-17",
"prov:wasAssociatedWith": ["niloofar_mireshghallah"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a URL?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or an @id reference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be an URL, and here is an @id reference of a prov:Person declared below. Fixing the @id reference.

"@id": "toxicContentRemoval",
"prov:description":"All toxic conversations identified by the OpenAI Moderations API or Detoxify have been removed from this version of the dataset.",
"prov:startedAtTime": "2024-07-22",
"prov:wasAssociatedWith": ["openai_moderation_api"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question

"@type": "prov:Activity",
"@id": "acitivity:PIIremoval",
"description":"The data has been de-identified with Microsoft Presidio and hand-written rules by the authors.",
"wasAssociatedWith": ["presidio"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question

@ccl-core ccl-core mentioned this pull request Nov 25, 2025
ccl-core added a commit that referenced this pull request Nov 26, 2025
This PR adds supports for external vocabularies, following
#885,
#955, and
#738.

It relies on the datasets kindly provided in
#970 for e2e testing.
@JoanGi
Copy link
Contributor Author

JoanGi commented Nov 27, 2025

Comments fixed - please double-check.

Regarding data-level provenance, I’ve added, as an example, an updated Croissant description of the Data Provenance Initiative Dataset from the Common Pile project (available on Hugging Face), which implements the proposed data-level PROV mechanism.

This dataset is composed of datums from 61 source datasets. At datum or "row" level it includes details about the source dataset and license information. Specifically, it provides a metadata field with subFields that reference the original datasets and their corresponding licenses.

The Field "metadata" of the Croissant description is now annotated, aiming to represent the following logical sequence

metadata is of type prov:Entity, and an equivalent property of "prov:wasDerivedFrom"

The attributes of this prov:Entity are:
  @id -> SubField medata/dataset_id
  prov:atlocation -> SubField metadata/url
  prov:wasAttributedTo -> subField metadata/license_url

I still have doubts about how to declare that the subField metadata/dataset_id is the @id of the prov:Entity. Please take a look!.

This example shows how the mechanism works for pointing to source datasets and thereby captures the data lineage. However, other datasets from the Common Pile, as the Github Archive , Library of Congress, or the Gutenberg project, have the same structure of data-level procenance, poiting to different kind of entities such as code repositories, politicians and laws, or authors and media. I recommend uploading examples using these datasets to demonstrate different uses of the mechanisms. (I can do it in a few days)

Copy link
Contributor

@benjelloun benjelloun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this example Joan!

}
},
{
"@type": "cr:Field",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an annotation instead of a field.

"@id": "default/metadata",
"equivalentProperty": "prov:wasDerivedFrom",
"dataType": [
"prov:Entity"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right type to use? Or do you want to say that the type is Dataset?

"dataType": [
"prov:Entity"
],
"subField": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should be breaking down the metadata json into separate fields, or just keep it as a single field. I definitely see the value of doing that, but it makes the example a bit complicated...

},
"extract": {
"column": "metadata"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this Field also have a transform and jsonPath? Same comment for license below.

"@type": "cr:Field",
"@id": "default/metadata/license_url",
"dataType": "sc:Text",
"equivalentProperty": "prov:wasAttributedTo",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right equivalentProperty?

"separator": "cr:separator",
"source": "cr:source",
"subField": "cr:subField",
"transform": "cr:transform"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add "containedIn": "cr:containedIn" as we're adding that to the context in the 1.1 spec.

Copy link
Contributor

@benjelloun benjelloun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick turn-around!

@JoanGi JoanGi changed the title Adding WildChat-1M and sQuad V2 dataset examples implementing the provenance mechanism. Adding WildChat-1M, sQuad V2 and, Data Provenance initaitive dataset examples implementing the provenance mechanism. Dec 1, 2025
@JoanGi JoanGi merged commit 6488f42 into mlcommons:main Dec 2, 2025
9 of 12 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Dec 2, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants