A Rust implementation for working with the ML Commons Croissant metadata format—a standardized way to describe machine learning datasets using JSON-LD.
Croissant is an open metadata standard designed to improve dataset documentation, searchability, and usage in machine learning workflows. This library simplifies the creation of Croissant-compatible metadata from CSV data sources by:
- Automatically inferring schema types from dataset content
- Generating complete, valid JSON-LD metadata
- Providing validation tools to ensure compatibility
- Supporting the full Croissant specification
This project provides both a command-line interface and a Rust library for converting CSV files to Croissant metadata format.
- Rust 1.88 or later
- Nix 2.25.4 or later
- Clone the repository:
git clone https://github.com/beyondcivic/rustcroissant.git
cd rustcroissant- Prepare the environment using Nix flakes (recommended):
nix developBuild the project using Nix:
nix buildThe resulting binary will be in the result/bin/ directory.
✨ GOOD TO KNOW: The nix build command can be used instead of cargo build command, as it now uses Nix to manage dependencies and build the project.
Run the CLI directly with Nix:
nix runOr, specify arguments:
nix run . -- generate data.csv -o metadata.jsonld✨GOOD TO KNOW: The nix run command can be used instead of cargo run command, as it now uses Nix to manage dependencies and run the project.
# Generate metadata with default output path
nix run . -- generate data.csv
# Specify output path
nix run . -- generate data.csv -o metadata.jsonlduse rustcroissant::generate_metadata;
fn main() {
let output_path = generate_metadata("data.csv", Some("dataset.jsonld"))
.expect("Error generating metadata");
println!("Metadata saved to: {}", output_path);
}- Automatically infers field data types from CSV content
- Calculates SHA-256 hash for file verification
- Generates Croissant metadata in JSON-LD format
- Configurable output path
The application supports configuration through environment variables with the prefix CROISSANT_.
Currently, only CROISSANT_OUTPUT_PATH is supported to specify the output file path for generated metadata.
If no output path is provided explicitly, the default output path metadata.jsonld will be used.
nix run . -- generate data.csv -o metadata.jsonldnix run . -- generate data.csv -o metadata.jsonld -v
Validation passed with no issues.
Croissant metadata generated and saved to: metadata.jsonnix run . -- generate data.csv -v
Validation passed with no issues.nix run . -- validate metadata.json
Validation passed with no issues.nix run . -- validate ./samples_jsonld/missing_fields.jsonld
Found the following 3 error(s) during the validation:
- [Metadata(mydataset) > FileObject(a-csv-table)] Property "https://schema.org/contentUrl" is mandatory, but does not exist.
- [Metadata(mydataset) > RecordSet(a-record-set) > Field(first-field)] The field does not specify a valid http://mlcommons.org/croissant/dataType, neither does any of its predecessor. Got:
- [Metadata(mydataset)] The current JSON-LD doesn't extend https://schema.org/Dataset.
Found the following 1 warning(s) during the validation:
- [Metadata(mydataset)] Property "http://purl.org/dc/terms/conformsTo" is recommended, but does not exist.
exit status 1
To add support for new data types, modify the infer_data_type function in src/croissant/core.rs:
fn infer_data_type(value: &str) -> &'static str {
// Existing data type detection...
// Add your new data type detection here
if my_custom_type_detector(value) {
return "sc:MyCustomType";
}
// Default to Text
"sc:Text"
}TODO.
Use Nix flakes to set up the build environment:
nix developCheck the build arguments in your Nix flake or shell.nix file as needed.
Then build and run the project using:
nix build
nix