Skip to content

eellak/glossapi-data-standardization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

glossAPI Text Dataset Standard

This repository defines the standard used by the glossAPI team for creating AI-ready textual datasets. It is based entirely on our internal study on modern data engineering practices for large-scale text pipelines.

The standard covers:

  • File formats (JSONL for ingestion, Parquet for storage and training)
  • Pipeline architecture (normalization, heuristic filtering, deduplication, PII handling, sharding)
  • Metadata requirements (Data Card fields)
  • Safety considerations

πŸ”§ How to Contribute

We keep this project intentionally simple with multiple ways to contribute:

πŸ’¬ Discussions (Recommended for questions & ideas)

  • Use GitHub Discussions for general questions, brainstorming, and community engagement
  • Great for exploring ideas before formal proposals

πŸ“ Issues (For tracked proposals)

  1. Open a new Issue on GitHub
  2. Add one of the following labels:
    • proposal β†’ for new ideas or enhancements
    • change-request β†’ for modifications to existing standards
    • question β†’ for clarifications

πŸ”€ Pull Requests (For direct contributions)

  1. Fork this repository
  2. Make your changes to files in /standards
  3. Submit a Pull Request with a clear description
  4. Maintainers will review and provide feedback

βœ”οΈ Scope of accepted contributions:

We only review proposals related to:

  • file formats (JSONL / Parquet)
  • pipeline stages (normalization, filtering, deduplication, safety, sharding)
  • metadata specification
  • safety and redaction practices

βœ”οΈ Review Process

  • The glossAPI maintainers review each Discussion, Issue, and Pull Request
  • Accepted changes are integrated into the /standards directory
  • Changes are documented in commit history (no RFC process, no complex governance)

For more details, see CONTRIBUTING.md.


Version: v1.0 (initial release).

About

No description, website, or topics provided.

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published