glossAPI Text Dataset Standard

This repository defines the standard used by the glossAPI team for creating AI-ready textual datasets. It is based entirely on our internal study on modern data engineering practices for large-scale text pipelines.

The standard covers:

File formats (JSONL for ingestion, Parquet for storage and training)
Pipeline architecture (normalization, heuristic filtering, deduplication, PII handling, sharding)
Metadata requirements (Data Card fields)
Safety considerations

🔧 How to Contribute

We keep this project intentionally simple with multiple ways to contribute:

💬 Discussions (Recommended for questions & ideas)

Use GitHub Discussions for general questions, brainstorming, and community engagement
Great for exploring ideas before formal proposals

📝 Issues (For tracked proposals)

Open a new Issue on GitHub
Add one of the following labels:
- proposal → for new ideas or enhancements
- change-request → for modifications to existing standards
- question → for clarifications

🔀 Pull Requests (For direct contributions)

Fork this repository
Make your changes to files in /standards
Submit a Pull Request with a clear description
Maintainers will review and provide feedback

✔️ Scope of accepted contributions:

We only review proposals related to:

file formats (JSONL / Parquet)
pipeline stages (normalization, filtering, deduplication, safety, sharding)
metadata specification
safety and redaction practices

✔️ Review Process

The glossAPI maintainers review each Discussion, Issue, and Pull Request
Accepted changes are integrated into the /standards directory
Changes are documented in commit history (no RFC process, no complex governance)

For more details, see CONTRIBUTING.md.

Version: v1.0 (initial release).

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
standards		standards
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

glossAPI Text Dataset Standard

🔧 How to Contribute

💬 Discussions (Recommended for questions & ideas)

📝 Issues (For tracked proposals)

🔀 Pull Requests (For direct contributions)

✔️ Scope of accepted contributions:

✔️ Review Process

About

Uh oh!

Releases

Packages

eellak/glossapi-data-standardization

Folders and files

Latest commit

History

Repository files navigation

glossAPI Text Dataset Standard

🔧 How to Contribute

💬 Discussions (Recommended for questions & ideas)

📝 Issues (For tracked proposals)

🔀 Pull Requests (For direct contributions)

✔️ Scope of accepted contributions:

✔️ Review Process

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages