🤖 Part of #5005.
Description
Add PPL/gap eval sets for noisy text produced by ASR, OCR, screen scraping, and vision-to-text tools. Agents increasingly consume imperfect transcriptions rather than clean web text.
Initial sources:
Definition of Done
- Add at least one ASR-noise source and one OCR-noise source.
- Preserve noisy text and clean reference text where available.
- Report clean-vs-noisy BPB deltas.
- Document licensing/access constraints, especially for Common Voice.
🤖 Part of #5005.
Description
Add PPL/gap eval sets for noisy text produced by ASR, OCR, screen scraping, and vision-to-text tools. Agents increasingly consume imperfect transcriptions rather than clean web text.
Initial sources:
Definition of Done