-
Notifications
You must be signed in to change notification settings - Fork 95
Adding WildChat-1M, sQuad V2 and, Data Provenance initaitive dataset examples implementing the provenance mechanism. #970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
benjelloun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding these datasets!
Do you have an example with data-level lineage?
| "usage": [ | ||
| "squad1" | ||
| ], | ||
| "isAssociatedWith": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prov:isAssociatedWith ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, how does this one relate to the prov:isAssociatedWith property below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's prov:isAssociatedWith yes!. This property related the Agent with the Activity, while the property below realted the agent with the whole dataset. Updating it!
| "@id": "journalismPIIremoval", | ||
| "prov:description":"Conversations flagged by Niloofar Mireshghallah and her collaborators in 'Breaking News: Case Studies of Generative AI's Use in Journalism' for containing PII or sensitive information have been removed from this version of the dataset.", | ||
| "prov:startedAtTime": "2024-10-17", | ||
| "prov:wasAssociatedWith": ["niloofar_mireshghallah"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a URL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or an @id reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "@id": "toxicContentRemoval", | ||
| "prov:description":"All toxic conversations identified by the OpenAI Moderations API or Detoxify have been removed from this version of the dataset.", | ||
| "prov:startedAtTime": "2024-07-22", | ||
| "prov:wasAssociatedWith": ["openai_moderation_api"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question
| "@type": "prov:Activity", | ||
| "@id": "acitivity:PIIremoval", | ||
| "description":"The data has been de-identified with Microsoft Presidio and hand-written rules by the authors.", | ||
| "wasAssociatedWith": ["presidio"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question
|
Comments fixed - please double-check. Regarding data-level provenance, I’ve added, as an example, an updated Croissant description of the Data Provenance Initiative Dataset from the Common Pile project (available on Hugging Face), which implements the proposed data-level PROV mechanism. This dataset is composed of datums from 61 source datasets. At datum or "row" level it includes details about the source dataset and license information. Specifically, it provides a metadata field with subFields that reference the original datasets and their corresponding licenses. The Field "metadata" of the Croissant description is now annotated, aiming to represent the following logical sequence I still have doubts about how to declare that the subField metadata/dataset_id is the @id of the prov:Entity. Please take a look!. This example shows how the mechanism works for pointing to source datasets and thereby captures the data lineage. However, other datasets from the Common Pile, as the Github Archive , Library of Congress, or the Gutenberg project, have the same structure of data-level procenance, poiting to different kind of entities such as code repositories, politicians and laws, or authors and media. I recommend uploading examples using these datasets to demonstrate different uses of the mechanisms. (I can do it in a few days) |
benjelloun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this example Joan!
| } | ||
| }, | ||
| { | ||
| "@type": "cr:Field", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be an annotation instead of a field.
| "@id": "default/metadata", | ||
| "equivalentProperty": "prov:wasDerivedFrom", | ||
| "dataType": [ | ||
| "prov:Entity" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the right type to use? Or do you want to say that the type is Dataset?
| "dataType": [ | ||
| "prov:Entity" | ||
| ], | ||
| "subField": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should be breaking down the metadata json into separate fields, or just keep it as a single field. I definitely see the value of doing that, but it makes the example a bit complicated...
| }, | ||
| "extract": { | ||
| "column": "metadata" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this Field also have a transform and jsonPath? Same comment for license below.
| "@type": "cr:Field", | ||
| "@id": "default/metadata/license_url", | ||
| "dataType": "sc:Text", | ||
| "equivalentProperty": "prov:wasAttributedTo", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the right equivalentProperty?
| "separator": "cr:separator", | ||
| "source": "cr:source", | ||
| "subField": "cr:subField", | ||
| "transform": "cr:transform" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add "containedIn": "cr:containedIn" as we're adding that to the context in the 1.1 spec.
…e compliant with 1.1 spec
…ty of the example.
benjelloun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick turn-around!
Summary
Beyond the guidelines provided in the upcoming specification for the provenance mechanism, it would be helpful to include examples of popular datasets that implement this mechanism.
These examples will simplify the process of adapting the mlcroissant library and assist adopters in integrating this feature effectively.
Changes in this PR
WildChat-1M dataset: Enhanced with provenance information.
sQuad V2 dataset: Added a minimal example demonstrating the linkage to the previous version of the sQuad dataset.
Common Pile: Data Provenance Initiative:: Implementing the provenance data-level mechanism.
These additions provide practical references for implementing the provenance mechanism in other datasets.