This repository defines the standard used by the glossAPI team for creating AI-ready textual datasets. It is based entirely on our internal study on modern data engineering practices for large-scale text pipelines.
The standard covers:
- File formats (JSONL for ingestion, Parquet for storage and training)
- Pipeline architecture (normalization, heuristic filtering, deduplication, PII handling, sharding)
- Metadata requirements (Data Card fields)
- Safety considerations
We keep this project intentionally simple with multiple ways to contribute:
- Use GitHub Discussions for general questions, brainstorming, and community engagement
- Great for exploring ideas before formal proposals
- Open a new Issue on GitHub
- Add one of the following labels:
proposalβ for new ideas or enhancementschange-requestβ for modifications to existing standardsquestionβ for clarifications
- Fork this repository
- Make your changes to files in
/standards - Submit a Pull Request with a clear description
- Maintainers will review and provide feedback
We only review proposals related to:
- file formats (JSONL / Parquet)
- pipeline stages (normalization, filtering, deduplication, safety, sharding)
- metadata specification
- safety and redaction practices
- The glossAPI maintainers review each Discussion, Issue, and Pull Request
- Accepted changes are integrated into the
/standardsdirectory - Changes are documented in commit history (no RFC process, no complex governance)
For more details, see CONTRIBUTING.md.
Version: v1.0 (initial release).