Skip to content

Conversation

@alexeyev
Copy link

@alexeyev alexeyev commented Dec 5, 2024

Hello, thank you for your fantastic work.

Please, add the support of the Kyrgyz language. How can I help?

In this pull request I provide the list of characters and a list of words built based on the two corpora from here using this hacky script:

import re

paths = [#"data/kir_community_2017/kir_community_2017-words.txt",
         "data/kir_newscrawl_2016_1M/kir_newscrawl_2016_1M-words.txt",
         "data/kir_wikipedia_2021_300K/kir_wikipedia_2021_300K-words.txt"]

tokens = []
removable = re.compile(r"(.*[′…ЇЈЎ&')¤/´˅(\"A-Za-z0-9Α-Ωα-ω.úƒƖ½ö+ЄІ,:;?!>< ]+.*|Ё.*|\w-\w+)", re.UNICODE)

for path in paths:
    with (open(path, "r", encoding="utf-8") as rf):
        for line in rf:
            line = line.strip()
            if line:
                split_line = line.split("\t")
                count = int(split_line[2])
                if count < 6:
                    continue
                token = split_line[1].strip() \
                    .replace("ɵ", "ө") \
                    .replace("ϴ", "Ө") \
                    .replace("ʏ", "ү")
                token = token.strip("​•₣‰ʿ°—­‘»²¬/µ«£:;“”„'()´`$%–№.,-")
                if len(token) > 2 and not removable.match(token):
                    tokens.append(token)

tokens = sorted(list(set(tokens)))
tokens_clipped_tail = []

for token in tokens:
    if token == "өөө":
        break
    else:
        tokens_clipped_tail.append(token)

with open("ky.txt", "w", encoding="utf-8") as wf:
    wf.write("\n".join(tokens_clipped_tail))

print(f"A total of {len(tokens_clipped_tail)} tokens.")

Best regards,
Anton.

@Hellomik2002
Copy link

Can you help me with the Kazakh lang, Write me please https://t.me/hellomik

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants