Skip to content

Conversation

@SATVIKsynopsis
Copy link
Contributor

Fixes #6881

Replaces one normalizer benchmark input with plain text content
derived from unicode-org/test-corpora.

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

components/normalizer/benches/data/en.txt looks garbled. I don't think it came from unicode-org/test-corpora.

@@ -0,0 +1 @@
Gutenbergih| |cugayso| |kaxxa| |Gatsbi| |ekitaabaGutenbergih| |cugayso| |kaxxa| |Gatsbi| |ekitaabaTama| |ebook| |Ameerikal| |kee| |baadal| |tan| |baaxooxal| |faxe| |numih| |gabat| |geytimam| |kee| |kalah| |tan| |baaxooxal| |faxe| |numih| |gabat| |geytima|.Atu| |koppi| |abtam|,| |tet| |taceem|,| |hinnay| |kaadu| |qagitaak| |edde| |tantifiqem| |duddah|,| |Gutenbergih| |Projektih| |Laysensih| |addat| |tan| |ebook| |lih| |hinnay| |kaadu| |www.gutenberg.org| |.Atu| |Ameerikal| |geytimtah| |tan| |baaxoh| |addal| |geytimtah| |tan| |baaxooxah| |madqooqi| |wagsiisak|,| |tama| |eBook| |wagsiisak| |geytimtah| |tan| |baaxoh| |madqooqi| |cubbussam| |faxximta|.Ammunta|:| |Kaxxa| |GatsbiOrbise|:| |F.| |Iskot| |FitzgeraldCabiyyi| |ayro|:| |Agda| |baxissok| |17|,| |2021| |[|ekitab| |#|64317|]mangom| |xayi| |uddurut| |yusqussubime|:| |qunxa| |garabluk| |26|,| |2025Af| |:| |EngeloCredits| |:| |Alex| |Cabal| |Standard| |Ebooks| |porojektih| |addat| |geytima|,| |Gutenberg| |Awustiraaliyah| |porojektih| |addat| |geytimah| |yan| |kutbeytal| |rakitak|.***| |CUGAAYSO| |QIMMIS| |GUTENBERG| |EBOOK| |KAXXA| |GATSBY| |***Kaxxa| |GatsbieddeF.| |Iskot| |FitzgeraldAddatinoh| |QarwaliAnuIIIIIIVVVIVIIVIIIIXQagitakwohuhZeldaTokkeek| |dahab| |koofiyat| |haysit|,| |toh| |tet| |tasgayye| |koo| |tekkek|;Atu| |fayyal| |kaqitte| |koo| |tekkek|,| |teetih| |kaadu| |kaqit|,is| |weqtam| |fanah| |"| |kacnoyta|,| |dahab|-|koofiya|,| |fayyale| |kacnoyta|,Anu| |koo| |aallem| |faxximta|!|”Toomas| |parki| |d'|inviliyersAnuyok| |qunxa| |mariino| |kee| |mango| |gadamsite| |liggidittet| |yabba| |dagoo| |fayu| |yoh| |yeceeh| |too| |anu| |inni| |mesenkacat| |fakkiimeh| |suge| |tohuk| |qemmissa| |haanam|."|Faxe| |way| |atu| |faxe| |num| |sadak| |sugte| |wak|,|"| |usuk| |yoh| |warse|,| |"|bas| |kassit| |ta| |baadal| |tan| |ummatta| |inkih| |atu| |luk| |sugte| |tuxxiq| |alle| |waytam|.|"Usuk| |mangom| |maxacinna|,| |takkay| |immay| |nanu| |umman| |way| |qaadah| |ane| |wayta| |angaaraw| |abneh| |sugneeh|,| |anu| |edde| |radeh| |usuk| |tohuk| |muxxi| |iyyam| |faxam|.Tohih| |taagah|,| |anu| |inkih| |tan| |cokmitte| |reserve| |abak|,| |mango| |curious| |natures| |yoh| |fakkiimeeh|,| |kaadu| |veteran| |bores| |lih| |angaaraw| |le|. No newline at end of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please at least at the top of the file the source path within test-corpora, such as

# Source within unicode-org/test-corpora: gutenberg/Fitzgerald-64317/out/google/txt/es/7069925737426509887_64317-h-1.htm.txt

@SATVIKsynopsis
Copy link
Contributor Author

I added the source file header containing source and path at the top.

@sffc
Copy link
Member

sffc commented Jan 18, 2026

The source header looks good. Please don't concatenate the lines together, though. Also I think you should use the txt file rather than the breakpoints file, since we are testing normalizer, not segmenter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace normalizer bench data with data from https://github.com/unicode-org/test-corpora/

2 participants