Skip to content

Conversation

@david4096
Copy link
Contributor

This PR implements features from issue #850 to enhance croissant-rdf with round-trip RDF conversion and multi-provider graph merging capabilities.

Changes

1. Fix test for varying @context formats

Some datasets return @context as a string while others return it as a dict. Updated test_fetch_data_workflow to handle both formats correctly.

2. RDF to JSON-LD conversion

Implements the reverse operation to regenerate Croissant JSON-LD from RDF files.

  • New convert_from_rdf() method in CroissantHarvester
  • New CLI tool: rdf-to-jsonld
  • Supports all RDF formats (Turtle, N-Triples, RDF/XML, etc.)
  • Comprehensive round-trip conversion tests

3. Multi-provider RDF merging

Enables combining RDF files from multiple Croissant providers into unified knowledge graphs.

  • New merge-rdf CLI tool
  • Automatic deduplication
  • Wildcard support for batch operations
  • Multiple output format options
  • Full test coverage

4. Documentation improvements

  • Comprehensive Quick Start section for all providers
  • CLI tools reference table
  • Practical use case examples (cross-platform catalogs, bioinformatics KG)
  • Improved SPARQL query examples
  • Architecture diagram

Test Results

All tests passing: 21/21 (71% code coverage)

CLI Tools Added

  • rdf-to-jsonld: Convert RDF back to Croissant JSON-LD
  • merge-rdf: Merge multiple RDF files into unified graphs

Example Usage

# Convert RDF to JSON-LD
rdf-to-jsonld datasets.ttl

# Merge multiple sources
merge-rdf huggingface.ttl openml.ttl kaggle.ttl -o unified.ttl

Addresses objectives from #850. @stefanches7

Some datasets return @context as a simple string (e.g., "https://schema.org")
while others return it as a dict with @vocab and namespace prefixes.
Updated test_fetch_data_workflow to handle both formats correctly.
Implements the reverse operation to regenerate Croissant JSON-LD from RDF files.
This addresses one of the key objectives from issue mlcommons#850.

Changes:
- Add convert_from_rdf() method to CroissantHarvester
- Create new rdf-to-jsonld CLI tool for easy conversion
- Add comprehensive tests for round-trip conversion
- Supports all RDF formats (Turtle, N-Triples, RDF/XML, etc.)
Implements the ability to merge RDF files from multiple Croissant providers
into a unified knowledge graph. This addresses issue mlcommons#850 objective.

Features:
- Merge multiple RDF files with automatic deduplication
- Support for various RDF formats (Turtle, N-Triples, RDF/XML, etc.)
- CLI tool 'merge-rdf' for easy merging
- Wildcard support for batch merging (e.g., *.ttl)
- Output format selection (turtle, json-ld, n3, nt, xml)
- Comprehensive tests for merging and deduplication

Example:
  merge-rdf huggingface.ttl openml.ttl kaggle.ttl -o unified.ttl
Major improvements:
- Added comprehensive Quick Start section with all providers
- Documented new CLI tools: rdf-to-jsonld and merge-rdf
- Added CLI tools reference table
- Included practical use cases (cross-platform catalogs, bioinformatics KG)
- Improved SPARQL query examples with better descriptions
- Added architecture diagram
- Reorganized development section for better clarity
- Highlighted multi-provider and knowledge graph merging capabilities

The README now reflects all new features implemented for issue mlcommons#850.
@david4096 david4096 requested a review from a team as a code owner October 8, 2025 15:20
@github-actions
Copy link

github-actions bot commented Oct 8, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@david4096 david4096 force-pushed the croissant-rdf-investigation branch from eeb7446 to f297487 Compare October 9, 2025 07:53
Changed rdf-to-jsonld and merge-rdf from standalone commands to
subcommands (to-jsonld and merge) under a unified croissant-rdf CLI.
The old standalone commands remain for backward compatibility.
Added documentation for the new croissant-rdf command with to-jsonld
and merge subcommands. Updated all usage examples to show the new
unified CLI while noting that legacy commands remain available.
Only the unified croissant-rdf CLI with to-jsonld and merge
subcommands is now available. Updated documentation accordingly.
@stefanches7
Copy link
Contributor

Looks good @david4096 ! I only wonder if the simple dispatches to rdflib will do the right thing we want here (especially in checking duplicates - what if we have 2 nodes with same Label, but different rdflib id - will it not be counted as duplicate`?). Apart from this, seen no substantial errors - let me know if you wanted me to test locally too or write tests (I am not an official reviewer here, @mlcommons/wg-croissant is).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants