Cleanup Dictionnaries - Reduce False Positive #17

Terrtia · 2025-04-09T08:31:01Z

Methodology:

graph TD
    A[Language Dictionary] --> A1[Normalize Dictionary<br>Sort and lowercase all entries]
    A1 --> B[Remove words not in kaikki.org dictionary]
    B --> C[Use GlotLID on removed words<br>Re-add if detected language matches]
    C --> D[Remove person/fictional names]
    D --> E[Remove common interjections<br>used in multiple languages]
    E --> F[Remove country names<br>used in multiple languages]
    F --> G[Cleaned Language Dictionary]

Some names were also present in kaikki.org dictionaries. I manually removed some of them, but some might still remain.

I noticed that the Serbian dictionary isn’t Serbian but Serbo-Croatian

https://kaikki.org/index.html
https://github.com/cisnlp/GlotLID

…guage

…ssian dict

pierotofy · 2025-04-09T15:36:52Z

Hey @Terrtia thanks for the PR. This looks like a good effort. When you mention you've reduced false positives, have you benchmarked the results? Can you share those results and your methodology?

Terrtia added 30 commits March 10, 2025 16:44

chg: [bengali] remove english words

6b78d08

chg: [vietnamese] remove other languages words part1

32ffc18

chg: [vietnamese] remove other languages words part2

662b2d6

chg: [vietnamese] remove other languages words

34ae0a2

chg: [english] sort dict

5c068d4

fix: [english language] cleanup

afb9a25

chg:[dictionaries] sort dictionaries

5709dfd

chg:[catalan] remove other languages words/characters

871ebd6

chg:[dutch] remove other languages words/characters

2db0c43

fix: [esperanto language] cleanup

47a6834

fix: [estoninan language] cleanup

d4e383e

fix: [finnish language] cleanup

8eb4478

fix: [french language] cleanup

b9347a6

fix: [german language] sort dict

1c0489c

fix: [german language] lowercase + uniq

e13e6cb

fix: [german language] cleanup

fe77a61

fix: [greek language] cleanup

38c7de9

fix: [hungarian language] cleanup

539fae4

fix: [indonesian language] cleanup

038df54

fix: [italian language] cleanup

f09988d

fix: [bulgarian language] cleanup

4f5fa9f

fix: [afrikkans language] cleanup

de2c09d

fix: [afrikaans language] cleanup

4be8d48

fix: [portuguese language] cleanup

c545788

fix: [spanish language] cleanup

d4dca9e

fix: [polish language] cleanup

4b446a3

fix: [albanian language] cleanup

9a5fb90

chg: [czech language] cleanup

f4ee4a3

chg: [danish language] cleanup

3052c4c

chg: [vietnamese language] filter removed content with glotlid

42d5655

Terrtia added 13 commits March 26, 2025 15:06

chg: [romanian language] cleanup

2370d59

chg: [turkish language] cleanup

9384906

chg: [slovak language] cleanup

7544cab

chg: [swedish language] cleanup

c7c319b

chg: [norwegian language] cleanup

58d60fa

chg: [slovenian language] cleanup

40e0ef4

chg: [lithuanian language] cleanup

656692b

chg: [latvian language] cleanup

1d31367

chg: [Serbo-Croatian language] cleanup. serbian is Serbo-Croatian lan…

47bef8c

…guage

chg: [languages] remove shared interjections

a6f58d3

chg: [french language] lowercase words

7e0bc2c

chg: [languages] remove shared interjections + improve english and ru…

305c36a

…ssian dict

chg: [dictionaries] cleanup + add words

29ea1d6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup Dictionnaries - Reduce False Positive #17

Cleanup Dictionnaries - Reduce False Positive #17

Terrtia commented Apr 9, 2025

pierotofy commented Apr 9, 2025

Cleanup Dictionnaries - Reduce False Positive #17

Are you sure you want to change the base?

Cleanup Dictionnaries - Reduce False Positive #17

Conversation

Terrtia commented Apr 9, 2025

pierotofy commented Apr 9, 2025