Kyrgyz language support #1344

alexeyev · 2024-12-05T06:50:45Z

Hello, thank you for your fantastic work.

Please, add the support of the Kyrgyz language. How can I help?

In this pull request I provide the list of characters and a list of words built based on the two corpora from here using this hacky script:

import re

paths = [#"data/kir_community_2017/kir_community_2017-words.txt",
         "data/kir_newscrawl_2016_1M/kir_newscrawl_2016_1M-words.txt",
         "data/kir_wikipedia_2021_300K/kir_wikipedia_2021_300K-words.txt"]

tokens = []
removable = re.compile(r"(.*[′…ЇЈЎ&')¤/´˅(\"A-Za-z0-9Α-Ωα-ω.úƒƖ½ö+ЄІ,:;?!>< ]+.*|Ё.*|\w-\w+)", re.UNICODE)

for path in paths:
    with (open(path, "r", encoding="utf-8") as rf):
        for line in rf:
            line = line.strip()
            if line:
                split_line = line.split("\t")
                count = int(split_line[2])
                if count < 6:
                    continue
                token = split_line[1].strip() \
                    .replace("ɵ", "ө") \
                    .replace("ϴ", "Ө") \
                    .replace("ʏ", "ү")
                token = token.strip("•₣‰ʿ°—‘»²¬/µ«£:;“”„'()´`$%–№.,-")
                if len(token) > 2 and not removable.match(token):
                    tokens.append(token)

tokens = sorted(list(set(tokens)))
tokens_clipped_tail = []

for token in tokens:
    if token == "өөө":
        break
    else:
        tokens_clipped_tail.append(token)

with open("ky.txt", "w", encoding="utf-8") as wf:
    wf.write("\n".join(tokens_clipped_tail))

print(f"A total of {len(tokens_clipped_tail)} tokens.")

Best regards,
Anton.

Hellomik2002 · 2024-12-16T14:10:44Z

Can you help me with the Kazakh lang, Write me please https://t.me/hellomik

alexeyev added 2 commits December 5, 2024 08:52

[ENH] Added Kyrgyz characters

2712959

[ENH] Added Kyrgyz dictionary

63462e8

alexeyev mentioned this pull request Dec 5, 2024

List of languages in development #91

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Kyrgyz language support #1344

Kyrgyz language support #1344

Uh oh!

alexeyev commented Dec 5, 2024

Uh oh!

Hellomik2002 commented Dec 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Kyrgyz language support #1344

Are you sure you want to change the base?

Kyrgyz language support #1344

Uh oh!

Conversation

alexeyev commented Dec 5, 2024

Uh oh!

Hellomik2002 commented Dec 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants