You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CONTRIBUTING.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
##Contributing
1
+
# Contributing
2
2
3
3
The best way to contribute to the MLCommons is to get involved with one of our many project communities. You find more information about getting involved with MLCommons [here](https://mlcommons.org/en/get-involved/#getting-started).
Copy file name to clipboardExpand all lines: README.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -159,6 +159,7 @@ Here is an extremely simple example of the Croissant format, with comments showi
159
159
- Via a `Croissant` tag button on the dataset's page (ex: <https://huggingface.co/datasets/CohereForAI/aya_collection>)
160
160
- Via their API (ex: <https://huggingface.co/api/datasets/CohereForAI/aya_collection/croissant>)
161
161
-[TFDS](https://www.tensorflow.org/datasets/overview) has a [`CroissantBuilder`](https://www.tensorflow.org/datasets/format_specific_dataset_builders#croissantbuilder) to transform any JSON-LD file into a TFDS dataset, which makes it possible to load the data into TensorFlow, JAX and PyTorch.
162
+
-[CKAN](https://ckan.org) supports Croissant through the [ckanext-dcat](https://github.com/ckan/ckanext-dcat) extension starting from version 2.3.0. The metadata is embedded in the dataset's page source and is also accessible through a dedicated endpoint. For datasets imported into the CKAN DataStore, the resources will expose Croissant's RecordSet objects, detailing data fields like column names and types.
162
163
-[Dataverse](https://dataverse.org) offers an [addon](https://github.com/gdcc/exporter-croissant) to export datasets in Croissant format and embed Croissant directly in the HTML of dataset landing pages.
Copy file name to clipboardExpand all lines: docs/croissant-rai-spec.md
+8-7Lines changed: 8 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,7 +27,7 @@ The [Croissant format](http://mlcommons.org/croissant/1.0) by design helps with
27
27
28
28
2. On the other hand, it records at a granular level how a dataset was created, processed and enriched throughout its lifecycle - this process is meant to be automated as much as possible by integrating Croissant with popular AI development environments.
29
29
30
-
One of the main instruments to operationalise RAI is dataset documentation.This document describes the responsible AI (RAI) aspects of Croissant, which were defined through a multi-step vocabulary engineering process as follows:
30
+
One of the main instruments to operationalise RAI is dataset documentation.This document describes the responsible AI (RAI) aspects of Croissant, which were defined through a multi-step vocabulary engineering process as follows:
31
31
32
32
1. Define use cases for the RAI-Croissant extension.
33
33
2. Compare and contrast existing dataset documentation vocabularies to identify overlaps and overlaps with the Croissant core vocabulary.
@@ -269,7 +269,7 @@ Compliance officers and legal teams require data-related information to **assess
269
269
270
270
-_Sensitive and personal identifiable information_: A description of the types of data present in the dataset, such as personally identifiable information, sensitive data, or any other categories that may be subject to privacy regulations in the GDPR Art. 5. (rai:personalSensitiveInformation)
271
271
-_Data purposes and limitations_: Information about the intended use of the data and the specific purposes for which it was collected (rai:dataUseCases), and the potential generalization limits and warning (rai:dataLimitations).
272
-
-_Data collection processes_: Information about how the data have been collected. For instance, the fields rai:dataCollection\_\_and rai:dataCollectionType give the user space to explain the collection process, and the rai:dataCollectionTimeFrame describes the collection's time span.
272
+
-_Data collection processes_: Information about how the data have been collected. For instance, the fields rai:dataCollection and rai:dataCollectionType give the user space to explain the collection process, and the rai:dataCollectionTimeFrame describes the collection's time span.
273
273
-_Data annotation processes_: Information about the annotation process (rai:annotationProtocol) along with the platforms used during them (rai:dataAnnotationPlatform) and also the guidelines and validation methods applied over the labels (rai:dataAnnotationAnalysis)
274
274
- Data retention policies: The duration for which the data will be stored and retained, considering the legal requirements and data protection laws.
275
275
- Data access control: Information about who has access to the data, the level of access privileges, and any measures implemented to control data access.
@@ -370,7 +370,7 @@ In relation to the creation of the datasets, as well as the labeling and annotat
<td>Considerations related to the process of converting the “raw” annotations into the labels that are ultimately packaged in a dataset - Uncertainty or disagreement between annotations on each instance as a signal in the dataset, analysis of systematic disagreements between annotators of different sociodemographic group, how the final dataset annotations will relate to individual annotator responses</td>
373
+
<td>Considerations related to the process of converting the “raw” annotations into the labels that are ultimately packaged in a dataset - Uncertainty or disagreement between annotations on each instance as a signal in the dataset, analysis of systematic disagreements between annotators of different socio-demographic group, how the final dataset annotations will relate to individual annotator responses</td>
374
374
</tr>
375
375
<tr>
376
376
<td>rai:dataReleaseMaintenancePlan</td>
@@ -447,9 +447,10 @@ Geospatial AI (also GeoAI) refers to the integration of artificial intelligence
447
447
448
448
In this regard, Responsible AI (RAI) emphasizes ethical, transparent, and accountable practices in the development and deployment of artificial intelligence systems, ensuring fair and unbiased outcomes. Geospatial Responsible AI, or Geospatial RAI involves ethical considerations in the acquisition and utilization of geospatial data, addressing potential biases, environmental impact, and privacy concerns. It also emphasizes transparency and fairness, ensuring that the application of AI in geospatial analysis aligns with ethical principles and societal values. Two examples showcasing the significance of RAI properties with respect to the GeoAI use cases are discussed below.
449
449
450
-
1._Importance of location_: Location or spatial properties are extremely important for the credibility of AI-ready datasets for GeoAI. AI based predictions and estimations pertaining to a location can change with the change in locational accuracy. For eg. for a task with AI-based crop yield prediction, ground truth for validating the AI model results are acquired from agricultural farms. Hence, in order to develop a robust and accurate AI model, it is important to annotate the training labels precisely. However, most of the time these annotations are approximated due to privacy concerns. Using these labeled datasets with AI models can lead to inaccurate predictions and estimations. RAI properties related to data lifecycle and data labeling such as annotator demographics and details about data preprocessing and manipulation respectively can increase support and confidence in the AI based modeling.\
450
+
1._Importance of location_: Location or spatial properties are extremely important for the credibility of AI-ready datasets for GeoAI. AI based predictions and estimations pertaining to a location can change with the change in locational accuracy. For example, for a task with AI-based crop yield prediction, ground truth for validating the AI model results are acquired from agricultural farms. Hence, in order to develop a robust and accurate AI model, it is important to annotate the training labels precisely. However, most of the time these annotations are approximated due to privacy concerns. Using these labeled datasets with AI models can lead to inaccurate predictions and estimations. RAI properties related to data lifecycle and data labeling such as annotator demographics and details about data preprocessing and manipulation respectively can increase support and confidence in the AI based modeling.
451
451
452
452
2._Importance of Sampling Strategy and biases_: Due to the large volume of the training data, sampling is a necessary step, especially in tasks utilizing petabytes of AI-ready datasets. Conventionally, sampling is performed to reduce the training data-size with an idea of masking redundant data samples from the training process. Uninformed sampling strategies can lead to biases in raw training data leading to inaccuracies in training the AI-models. Training datasets with Imbalance class information is an example of such biases. RAI properties describing such data biases and limitations enhance the awareness beforehand training the AI model, and proper techniques can be adopted for better representation of the datasets.
453
+
453
454
3._GeoAI Training Data life cycle:_ The temporal specificity of numerous GeoAI applications renders training data obsolete once the designated time window elapses, limiting the continued relevance and effectiveness of the acquired datasets. This is prominent in GeoAI applications such as disaster monitoring and assessment or seasonal agricultural crop yield estimation. For such use-cases, data life cycle RAI properties defining description of the data collection process, description of missing data and timeframe of the collected data plays a key role in improving the applicability of the AI models for such applications.
454
455
455
456
Below is an example of RAI properties in a Geospatial AI-ready dataset - HLS Burn Scar Scenes [2] dataset, in the Croissant format. This dataset is openly available on [Hugging Face](https://huggingface.co/datasets/ibm-nasa-geospatial/hls_burn_scars) and contains Harmonized Landsat and Sentinel-2 imagery of burn scars and the associated masks for the years 2018-2021 over the contiguous United States.
@@ -519,7 +520,7 @@ As the size of language models continues to increase, there is a growing demand
"rai:dataCollection": "The first part of the corpus, accounting for 62% of the final dataset size (in bytes), is made up of a collection of monolingual and multilingual language resources that were selected and documented collaboratively through various efforts of the BigScience Data Sourcing working group. The 38& remaining is get from the OSCAR version 21.09, based on the Common Crawl snapshot of February.",
523
+
"rai:dataCollection": "The first part of the corpus, accounting for 62% of the final dataset size (in bytes), is made up of a collection of monolingual and multilingual language resources that were selected and documented collaboratively through various efforts of the BigScience Data Sourcing working group. The 38% remaining is get from the OSCAR version 21.09, based on the Common Crawl snapshot of February.",
523
524
"rai:dataCollectionType": [
524
525
"Web Scraping",
525
526
"Secondary Data Analysis",
@@ -536,8 +537,8 @@ As the size of language models continues to increase, there is a growing demand
536
537
"The reliance on medium to large sources of digitized content still over-represents privileged voices and language varieties."
537
538
],
538
539
"rai:dataBiases": "Dataset includes multiple sub-ratings which specify the type of safety concern, such as type of hate speech and the type of bias or misinformation, for each conversation. A limitation of the dataset is the selection of demographic characteristics. The number of demographic categories was limited to four (race/ethnicity, gender and age group). Within these demographic axes, the number of subgroups was further limited (i.e., two locales, five main ethnicity groups, three age groups and two genders), this constrained the insights from systematic differences between different groupings of raters.",
539
-
"rai:personalSensitiveInformation": "We used a rule-based approach leveraging regular expressions (Appendix C). The elements redacted were instances of KEY (numeric & alphanumeric identifiers such as phone numbers, credit card numbers, hexadecimal hashes and the like, while skipping instances of years and simple numbers), EMAIL (email addresses), USER (a social media handle) and IP_ADDRESS (an IPv4 or IPv6 address).“,",
540
-
"rai:dataSocialImpact": "The authors emphasized that the BigScience Research Workshop, under which the dataset was developed, was conceived as a collaborative and value-driven endeavor from the beginning. This approach significantly influenced the project's decisions, leading to numerous discussions aimed at aligning the project’s core values with those of the data contributors, as well as considering the social impact on individuals directly and indirectly impacted by the project. These discussions and the project's governance strategy highlighted the importance of: Centre human selection of the data, suggesting a conscientious approach to choosing what data to include in the corpus based on ethical considerations and the potential social impact. Data release and governance strategies that would responsibly manage the distribution and use of the data. Although the document does not explicitly list specific potential social impacts, the emphasis on value-driven efforts, ethical considerations, and the human-centered approach to data selection suggests a keen awareness and proactive stance on mitigating negative impacts while enhancing positive social outcomes through responsible data collection and usage practices",
540
+
"rai:personalSensitiveInformation": "We used a rule-based approach leveraging regular expressions (Appendix C). The elements redacted were instances of KEY (numeric & alphanumeric identifiers such as phone numbers, credit card numbers, hexadecimal hashes and the like, while skipping instances of years and simple numbers), EMAIL (email addresses), USER (a social media handle) and IP_ADDRESS (an IPv4 or IPv6 address).",
541
+
"rai:dataSocialImpact": "The authors emphasized that the BigScience Research Workshop, under which the dataset was developed, was conceived as a collaborative and value-driven endeavor from the beginning. This approach significantly influenced the project's decisions, leading to numerous discussions aimed at aligning the project’s core values with those of the data contributors, as well as considering the social impact on individuals directly and indirectly impacted by the project. These discussions and the project's governance strategy highlighted the importance of: Centre human selection of the data, suggesting a conscientious approach to choosing what data to include in the corpus based on ethical considerations and the potential social impact. Data release and governance strategies that would responsibly manage the distribution and use of the data. Although the document does not explicitly list specific potential social impacts, the emphasis on value-driven efforts, ethical considerations, and the human-centered approach to data selection suggests a keen awareness and proactive stance on mitigating negative impacts while enhancing positive social outcomes through responsible data collection and usage practices.",
541
542
"rai:dataManipulationProtocol": [
542
543
"Pseudocode to recreate the text structure from the HTML code. The HTML code of a web page provides information about the structure of the text. The final structure of a web page is, however, the one produced by the rendering engine of the web browser and any CSS instructions. The latter two elements, which can vary enormously from one situation to another, always use the tag types for their rendering rules. Therefore, we have used a 20 fairly simple heuristic on tag types to reconstruct the structure of the text extracted from an HTML code. To reconstruct the text, the HTML DOM, which can be represented as a tree is traversed with a depth-first search algorithm. The text is initially empty and each time a new node with textual content is reached its content is concatenated according to the rules presented in the Algorithm 1 of the accompanying paper.",
543
544
"Data cleaning and filtering: documents were filtered with:• Too high character repetition or word repetition as a measure of repetitive content.• Too high ratios of special characters to remove page code or crawling artifacts.• Insufficient ratios of closed class words to filter out SEO pages.• Too high ratios of flagged words to filter out pornographic spam. We asked contributors to tailor the word list in their language to this criterion (as opposed to generic terms related to sexuality) and to err on the side of high precision. • Too high perplexity values to filter out non-natural language. • Insufficient number of words, as LLM training requires extensive context sizes.",
1. print extra information, like the generated nodes;
200
202
2. save the generated structure graph to a folder indicated in the logs.
201
203
@@ -205,4 +207,3 @@ To publish a package,
205
207
206
208
1. Bump the version in `croissant/python/mlcroissant/pyproject.toml`, and merge your PR.
207
209
2. Publish a [new release](https://github.com/mlcommons/croissant/releases) in GitHub, and add a tag to it with the newest version in `pyproject.toml`. Ensure that the new release is marked as `latest`. The workflow script `python-publish.yml` will trigger and publish the package to [PyPI](https://pypi.org/project/mlcroissant/).
0 commit comments