Merge branch 'mlcommons:main' into main

tombagby · web-flow · commit 1a9c09f65051 · 2025-05-08T10:47:46.000-07:00
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,4 +1,4 @@
-## Contributing
+# Contributing
 
 The best way to contribute to the MLCommons is to get involved with one of our many project communities. You find more information about getting involved with MLCommons [here](https://mlcommons.org/en/get-involved/#getting-started). 
 
diff --git a/README.md b/README.md
@@ -159,6 +159,7 @@ Here is an extremely simple example of the Croissant format, with comments showi
   - Via a `Croissant` tag button on the dataset's page (ex: <https://huggingface.co/datasets/CohereForAI/aya_collection>)
   - Via their API (ex: <https://huggingface.co/api/datasets/CohereForAI/aya_collection/croissant>)
 - [TFDS](https://www.tensorflow.org/datasets/overview) has a [`CroissantBuilder`](https://www.tensorflow.org/datasets/format_specific_dataset_builders#croissantbuilder) to transform any JSON-LD file into a TFDS dataset, which makes it possible to load the data into TensorFlow, JAX and PyTorch.
+- [CKAN](https://ckan.org) supports Croissant through the [ckanext-dcat](https://github.com/ckan/ckanext-dcat) extension starting from version 2.3.0. The metadata is embedded in the dataset's page source and is also accessible through a dedicated endpoint. For datasets imported into the CKAN DataStore, the resources will expose Croissant's RecordSet objects, detailing data fields like column names and types.
 - [Dataverse](https://dataverse.org) offers an [addon](https://github.com/gdcc/exporter-croissant) to export datasets in Croissant format and embed Croissant directly in the HTML of dataset landing pages.
 
 ## Licensing
diff --git a/datasets/0.8/audio_test/output/records.jsonl b/datasets/0.8/audio_test/output/records.jsonl
@@ -1,2 +1,2 @@
-{"audio": "(array([-2.8619270e-13, -1.7014803e-13,  2.7065091e-14, ...,\n       -6.4091455e-06, -3.7976279e-06,  2.7510678e-06], dtype=float32), 22050)"}
-{"audio": "(array([5.8726583e-14, 1.3397688e-13, 2.2199205e-13, ..., 4.2678180e-04,\n       1.9029720e-04, 2.7079385e-04], dtype=float32), 22050)"}
+{"audio": "(array([-2.8619270e-13, -1.7014803e-13,  2.7065091e-14, ...,\n       -6.4091455e-06, -3.7976279e-06,  2.7510678e-06],\n      shape=(25872,), dtype=float32), 22050)"}
+{"audio": "(array([5.8726583e-14, 1.3397688e-13, 2.2199205e-13, ..., 4.2678180e-04,\n       1.9029720e-04, 2.7079385e-04], shape=(32928,), dtype=float32), 22050)"}
diff --git a/datasets/1.0/audio_test/output/records.jsonl b/datasets/1.0/audio_test/output/records.jsonl
@@ -1,2 +1,2 @@
-{"records/audio": "(array([-2.8619270e-13, -1.7014803e-13,  2.7065091e-14, ...,\n       -6.4091455e-06, -3.7976279e-06,  2.7510678e-06], dtype=float32), 22050)"}
-{"records/audio": "(array([5.8726583e-14, 1.3397688e-13, 2.2199205e-13, ..., 4.2678180e-04,\n       1.9029720e-04, 2.7079385e-04], dtype=float32), 22050)"}
+{"records/audio": "(array([-2.8619270e-13, -1.7014803e-13,  2.7065091e-14, ...,\n       -6.4091455e-06, -3.7976279e-06,  2.7510678e-06],\n      shape=(25872,), dtype=float32), 22050)"}
+{"records/audio": "(array([5.8726583e-14, 1.3397688e-13, 2.2199205e-13, ..., 4.2678180e-04,\n       1.9029720e-04, 2.7079385e-04], shape=(32928,), dtype=float32), 22050)"}
diff --git a/datasets/README.md b/datasets/README.md
@@ -1,6 +1,8 @@
+# Dataset Definitions
+
 This directory contains definitions of datasets (one sub-directory per version and per dataset).
 
-```
+```text
 datasets
 |_ 0.8                 # Croissant major version (standard)
   |_ titanic
diff --git a/docs/croissant-rai-spec.md b/docs/croissant-rai-spec.md
@@ -27,7 +27,7 @@ The [Croissant format](http://mlcommons.org/croissant/1.0) by design helps with
 
 2. On the other hand, it records at a granular level how a dataset was created, processed and enriched throughout its lifecycle - this process is meant to be automated as much as possible by integrating Croissant with popular AI development environments.
 
-One of the main instruments to operationalise RAI is dataset documentation.This document describes the responsible AI (RAI) aspects of Croissant, which were defined through a multi-step vocabulary engineering process as follows:
+One of the main instruments to operationalise RAI is dataset documentation. This document describes the responsible AI (RAI) aspects of Croissant, which were defined through a multi-step vocabulary engineering process as follows:
 
 1. Define use cases for the RAI-Croissant extension.
 2. Compare and contrast existing dataset documentation vocabularies to identify overlaps and overlaps with the Croissant core vocabulary.
@@ -269,7 +269,7 @@ Compliance officers and legal teams require data-related information to **assess
 
 - _Sensitive and personal identifiable information_: A description of the types of data present in the dataset, such as personally identifiable information, sensitive data, or any other categories that may be subject to privacy regulations in the GDPR Art. 5. (rai:personalSensitiveInformation)
 - _Data purposes and limitations_: Information about the intended use of the data and the specific purposes for which it was collected (rai:dataUseCases), and the potential generalization limits and warning (rai:dataLimitations).
-- _Data collection processes_: Information about how the data have been collected. For instance, the fields rai:dataCollection\_ \_and rai:dataCollectionType give the user space to explain the collection process, and the rai:dataCollectionTimeFrame describes the collection's time span.
+- _Data collection processes_: Information about how the data have been collected. For instance, the fields rai:dataCollection and rai:dataCollectionType give the user space to explain the collection process, and the rai:dataCollectionTimeFrame describes the collection's time span.
 - _Data annotation processes_: Information about the annotation process (rai:annotationProtocol) along with the platforms used during them (rai:dataAnnotationPlatform) and also the guidelines and validation methods applied over the labels (rai:dataAnnotationAnalysis)
 - Data retention policies: The duration for which the data will be stored and retained, considering the legal requirements and data protection laws.
 - Data access control: Information about who has access to the data, the level of access privileges, and any measures implemented to control data access.
@@ -370,7 +370,7 @@ In relation to the creation of the datasets, as well as the labeling and annotat
     <td><a href="https://schema.org/Text">sc:Text</a></td>
     <td>Data labeling</td>
     <td>MANY</td>
-    <td>Considerations related to the process of converting the “raw” annotations into the labels that are ultimately packaged in a dataset - Uncertainty or disagreement between annotations on each instance as a signal in the dataset, analysis of systematic disagreements between annotators of different socio demographic group, how the final dataset annotations will relate to individual annotator responses</td>
+    <td>Considerations related to the process of converting the “raw” annotations into the labels that are ultimately packaged in a dataset - Uncertainty or disagreement between annotations on each instance as a signal in the dataset, analysis of systematic disagreements between annotators of different socio-demographic group, how the final dataset annotations will relate to individual annotator responses</td>
   </tr>
   <tr>
     <td>rai:dataReleaseMaintenancePlan</td>
@@ -447,9 +447,10 @@ Geospatial AI (also GeoAI) refers to the integration of artificial intelligence
 
 In this regard, Responsible AI (RAI) emphasizes ethical, transparent, and accountable practices in the development and deployment of artificial intelligence systems, ensuring fair and unbiased outcomes. Geospatial Responsible AI, or Geospatial RAI involves ethical considerations in the acquisition and utilization of geospatial data, addressing potential biases, environmental impact, and privacy concerns. It also emphasizes transparency and fairness, ensuring that the application of AI in geospatial analysis aligns with ethical principles and societal values. Two examples showcasing the significance of RAI properties with respect to the GeoAI use cases are discussed below.
 
-1. _Importance of location_: Location or spatial properties are extremely important for the credibility of AI-ready datasets for GeoAI. AI based predictions and estimations pertaining to a location can change with the change in locational accuracy. For eg. for a task with AI-based crop yield prediction, ground truth for validating the AI model results are acquired from agricultural farms. Hence, in order to develop a robust and accurate AI model, it is important to annotate the training labels precisely. However, most of the time these annotations are approximated due to privacy concerns. Using these labeled datasets with AI models can lead to inaccurate predictions and estimations. RAI properties related to data lifecycle and data labeling such as annotator demographics and details about data preprocessing and manipulation respectively can increase support and confidence in the AI based modeling. \
+1. _Importance of location_: Location or spatial properties are extremely important for the credibility of AI-ready datasets for GeoAI. AI based predictions and estimations pertaining to a location can change with the change in locational accuracy. For example, for a task with AI-based crop yield prediction, ground truth for validating the AI model results are acquired from agricultural farms. Hence, in order to develop a robust and accurate AI model, it is important to annotate the training labels precisely. However, most of the time these annotations are approximated due to privacy concerns. Using these labeled datasets with AI models can lead to inaccurate predictions and estimations. RAI properties related to data lifecycle and data labeling such as annotator demographics and details about data preprocessing and manipulation respectively can increase support and confidence in the AI based modeling.
 
 2. _Importance of Sampling Strategy and biases_: Due to the large volume of the training data, sampling is a necessary step, especially in tasks utilizing petabytes of AI-ready datasets. Conventionally, sampling is performed to reduce the training data-size with an idea of masking redundant data samples from the training process. Uninformed sampling strategies can lead to biases in raw training data leading to inaccuracies in training the AI-models. Training datasets with Imbalance class information is an example of such biases. RAI properties describing such data biases and limitations enhance the awareness beforehand training the AI model, and proper techniques can be adopted for better representation of the datasets.
+
 3. _GeoAI Training Data life cycle:_ The temporal specificity of numerous GeoAI applications renders training data obsolete once the designated time window elapses, limiting the continued relevance and effectiveness of the acquired datasets. This is prominent in GeoAI applications such as disaster monitoring and assessment or seasonal agricultural crop yield estimation. For such use-cases, data life cycle RAI properties defining description of the data collection process, description of missing data and timeframe of the collected data plays a key role in improving the applicability of the AI models for such applications.
 
 Below is an example of RAI properties in a Geospatial AI-ready dataset - HLS Burn Scar Scenes [2] dataset, in the Croissant format. This dataset is openly available on [Hugging Face](https://huggingface.co/datasets/ibm-nasa-geospatial/hls_burn_scars) and contains Harmonized Landsat and Sentinel-2 imagery of burn scars and the associated masks for the years 2018-2021 over the contiguous United States.
@@ -519,7 +520,7 @@ As the size of language models continues to increase, there is a growing demand
   "@type": "schema.org/Dataset",
   "name": "BigScience Root Corpus",
   "dct:conformsTo": "http://mlcommons.org/croissant/RAI/1.0",
-  "rai:dataCollection": "The first part of the corpus, accounting for 62% of the final dataset size (in bytes), is made up of a collection of monolingual and multilingual language resources that were selected and documented collaboratively through various efforts of the BigScience Data Sourcing working group. The 38& remaining is get from the OSCAR version 21.09, based on the Common Crawl snapshot of February.",
+  "rai:dataCollection": "The first part of the corpus, accounting for 62% of the final dataset size (in bytes), is made up of a collection of monolingual and multilingual language resources that were selected and documented collaboratively through various efforts of the BigScience Data Sourcing working group. The 38% remaining is get from the OSCAR version 21.09, based on the Common Crawl snapshot of February.",
   "rai:dataCollectionType": [
     "Web Scraping",
     "Secondary Data Analysis",
@@ -536,8 +537,8 @@ As the size of language models continues to increase, there is a growing demand
     "The reliance on medium to large sources of digitized content still over-represents privileged voices and language varieties."
   ],
   "rai:dataBiases": "Dataset includes multiple sub-ratings which specify the type of safety concern, such as type of hate speech and the type of bias or misinformation, for each conversation. A limitation of the dataset is the selection of demographic characteristics. The number of demographic categories was limited to four (race/ethnicity, gender and age group). Within these demographic axes, the number of subgroups was further limited (i.e., two locales, five main ethnicity groups, three age groups and two genders), this constrained the insights from systematic differences between different groupings of raters.",
-  "rai:personalSensitiveInformation": "We used a rule-based approach leveraging regular expressions (Appendix C). The elements redacted were instances of KEY (numeric & alphanumeric identifiers such as phone numbers, credit card numbers, hexadecimal hashes and the like, while skipping instances of years and simple numbers), EMAIL (email addresses), USER (a social media handle) and IP_ADDRESS (an IPv4 or IPv6 address).“,",
-  "rai:dataSocialImpact": "The authors emphasized that the BigScience Research Workshop, under which the dataset was developed, was conceived as a collaborative and value-driven endeavor from the beginning. This approach significantly influenced the project's decisions, leading to numerous discussions aimed at aligning the project’s core values with those of the data contributors, as well as considering the social impact on individuals directly and indirectly impacted by the project. These discussions and the project's governance strategy highlighted the importance of: Centre human selection of the data, suggesting a conscientious approach to choosing what data to include in the corpus based on ethical considerations and the potential social impact. Data release and governance strategies that would responsibly manage the distribution and use of the data. Although the document does not explicitly list specific potential social impacts, the emphasis on value-driven efforts, ethical considerations, and the human-centered approach to data selection suggests a keen awareness and proactive stance on mitigating negative impacts while enhancing positive social outcomes through responsible data collection and usage practices",
+  "rai:personalSensitiveInformation": "We used a rule-based approach leveraging regular expressions (Appendix C). The elements redacted were instances of KEY (numeric & alphanumeric identifiers such as phone numbers, credit card numbers, hexadecimal hashes and the like, while skipping instances of years and simple numbers), EMAIL (email addresses), USER (a social media handle) and IP_ADDRESS (an IPv4 or IPv6 address).",
+  "rai:dataSocialImpact": "The authors emphasized that the BigScience Research Workshop, under which the dataset was developed, was conceived as a collaborative and value-driven endeavor from the beginning. This approach significantly influenced the project's decisions, leading to numerous discussions aimed at aligning the project’s core values with those of the data contributors, as well as considering the social impact on individuals directly and indirectly impacted by the project. These discussions and the project's governance strategy highlighted the importance of: Centre human selection of the data, suggesting a conscientious approach to choosing what data to include in the corpus based on ethical considerations and the potential social impact. Data release and governance strategies that would responsibly manage the distribution and use of the data. Although the document does not explicitly list specific potential social impacts, the emphasis on value-driven efforts, ethical considerations, and the human-centered approach to data selection suggests a keen awareness and proactive stance on mitigating negative impacts while enhancing positive social outcomes through responsible data collection and usage practices.",
   "rai:dataManipulationProtocol": [
     "Pseudocode to recreate the text structure from the HTML code. The HTML code of a web page provides information about the structure of the text. The final structure of a web page is, however, the one produced by the rendering engine of the web browser and any CSS instructions. The latter two elements, which can vary enormously from one situation to another, always use the tag types for their rendering rules. Therefore, we have used a 20 fairly simple heuristic on tag types to reconstruct the structure of the text extracted from an HTML code. To reconstruct the text, the HTML DOM, which can be represented as a tree is traversed with a depth-first search algorithm. The text is initially empty and each time a new node with textual content is reached its content is concatenated according to the rules presented in the Algorithm 1 of the accompanying paper.",
     "Data cleaning and filtering: documents were filtered with:• Too high character repetition or word repetition as a measure of repetitive content.• Too high ratios of special characters to remove page code or crawling artifacts.• Insufficient ratios of closed class words to filter out SEO pages.• Too high ratios of flagged words to filter out pornographic spam. We asked contributors to tailor the word list in their language to this criterion (as opposed to generic terms related to sexuality) and to err on the side of high precision. • Too high perplexity values to filter out non-natural language. • Insufficient number of words, as LLM training requires extensive context sizes.",
diff --git a/editor/README.md b/editor/README.md
@@ -22,7 +22,7 @@ You can debug the tests in Github Actions because failed screenshots are uploade
 
 You may need to install [`libmagic`](https://pypi.org/project/python-magic).
 
-# Create a custom component
+## Create a custom component
 
 Custom components are in `components/`.
 
@@ -102,7 +102,7 @@ st.write(node)
 
 - Build the JavaScript locally.
 
-```
+```bash
 cd components/tree/frontend/
 npm run build
 ```
@@ -114,7 +114,7 @@ npm run build
 
 Change `TAG` below and execute the following commands:
 
-```
+```bash
 TAG=0.0.1
 IMAGE=croissant-editor
 docker build -t ${IMAGE} .
diff --git a/health/README.md b/health/README.md
@@ -1,6 +1,6 @@
 # Croissant 🥐 Online Health
 
-This project aims at monitoring the health of the online Croissant ecocystem
+This project aims at monitoring the health of the online Croissant ecosystem
 by crawling online JSON-LD files shared across repositories.
 
 It contains:
diff --git a/python/mlcroissant/README.md b/python/mlcroissant/README.md
@@ -34,6 +34,7 @@ sudo apt-get install python3-dev graphviz libgraphviz-dev pkg-config
 ```
 
 ### Conda installation
+
 Conda can help create a consistent environment.
 It can also be useful to install packages without root access.
 To use Conda, run:
@@ -196,6 +197,7 @@ mlcroissant validate --jsonld ../../datasets/titanic/metadata.json --debug
 ```
 
 This will:
+
 1. print extra information, like the generated nodes;
 2. save the generated structure graph to a folder indicated in the logs.
 
@@ -205,4 +207,3 @@ To publish a package,
 
 1. Bump the version in `croissant/python/mlcroissant/pyproject.toml`, and merge your PR.
 2. Publish a [new release](https://github.com/mlcommons/croissant/releases) in GitHub, and add a tag to it with the newest version in `pyproject.toml`. Ensure that the new release is marked as `latest`. The workflow script `python-publish.yml` will trigger and publish the package to [PyPI](https://pypi.org/project/mlcroissant/).
-
diff --git a/python/mlcroissant/mlcroissant/_src/datasets_nonhermetic_test.py b/python/mlcroissant/mlcroissant/_src/datasets_nonhermetic_test.py
diff --git a/python/mlcroissant/mlcroissant/_src/datasets_test.py b/python/mlcroissant/mlcroissant/_src/datasets_test.py
diff --git a/python/mlcroissant/mlcroissant/_src/operation_graph/operations/read.py b/python/mlcroissant/mlcroissant/_src/operation_graph/operations/read.py

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-## Contributing`
	`1`	`+# Contributing`
`2`	`2`
`3`	`3`	`The best way to contribute to the MLCommons is to get involved with one of our many project communities. You find more information about getting involved with MLCommons [here](https://mlcommons.org/en/get-involved/#getting-started).`
`4`	`4`