Skip to content
Harsh Shinde edited this page Aug 31, 2025 · 20 revisions

AI-ready Dataset Metadata as a Service

ZOO-Project MLC-Croissant GSoC OSGeo

Brief Description

The project aims to enhance the ZOO-Project with native support for GeoCroissant metadata, enabling AI-ready geospatial datasets. It will provide tools for metadata generation, validation, and integration with platforms like STAC, Earth Engine, and HuggingFace, along with data-centric AI workflows for improving dataset quality.

State of the Project Before GSoC

While the ZOO-Project already offers solid support for OGC-compliant geoprocessing, it currently doesn’t have built-in support for GeoCroissant—a metadata standard designed specifically for AI-ready geospatial datasets. There are no tools available within ZOO to help users create or validate this kind of metadata or to connect easily with existing platforms like STAC, Earth Engine, or machine learning hubs like HuggingFace and Kaggle. It also lacks workflows that can help users check the quality of their training data or fix common issues like annotation errors or bias. This project aims to fill those gaps and bring these much-needed features to the ZOO-Project.

Deliverables

  • Integration of GeoCroissant metadata support into OGC API – Processes
  • Services for metadata generation, validation, and conversion from STAC, Earth Engine, HuggingFace, and Kaggle
  • REST endpoints for metadata hosting and JSON-LD-based service chaining
  • Implementation of Data-Centric AI workflows using Cleanlab for label noise and bias detection
  • Interoperability tools for STAC, OGC TrainingDML, and MLCommons Croissant formats
  • Full test suite, example datasets, and usage tutorials
  • Comprehensive documentation and project wiki with deployment guides

Detailed Proposal

Detailed Proposal Link (Google Doc)

Participants

Title Name GitHub Handle
1st Mentor Chetan Mahajan @cOsprey
2nd Mentor Gérald Fenoy @gfenoy
Student Developer Harsh Shinde @HarshShinde0

Clone this wiki locally