-
Notifications
You must be signed in to change notification settings - Fork 194
Feature/mecab [/tokenize] support for mecab #2254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Feature/mecab [/tokenize] support for mecab #2254
Conversation
…horeels/yomitan into feature/mecab-tokenizer-improvements
|
I've added handling of translated POS like unidic-mecab-translate. There's a small caveat in the fact the "aux-verb" pos1 has been cut to "aux" by the mecab.py script in the https://github.com/yomidevs/yomitan-mecab-installer I'm pretty satisfied with the results for 40 sentences, 10 I crafted/extracted from books I saw the original tokenize parsing incorrectly some things like がやる into がや+る, and I generated 30 extra sentences. All the cases are in the file attached below, and here the summary : Full results : tokenize_test.txt The mecab ipadic/unidic difference are related to granularity of some words for example But this is has to be expected since one dictinonary is more granular than the other. For the mecab vs scan, sometimes it's just a matter of a view details like punctuations being aggregated with the scan method : And sometimes it's more about the greediness of the scan method |
|
Small update that I'm currently integrating this change with asbplayer latest merged PR killergerbah/asbplayer#813. I'm still doing my best to achieve as much consistent parsing between ipadic/unidic, but for some stuff I'll have to modify the mecab-api to preserve certain fields to be able to differentiate between だ as a copula or the past tense |
Hello,
For people with mecab installed, I've been working on using mecab as another way to /tokenize sentence.
The main benefit is performance, and its non-greediness which helps not "eating" この世界 into この世+界 or がやる into がや+る (see more examples below). There are some examples of simple vs mecab tokenize with this branch
As you can see, there are a few differences, typically for things like そっか that suddenly gets splitted into two, but this won't break lookups made on the そ of そっか to match it during real lookups, it simply helps the tools using the /tokenize endpoint to now that those are two different entries.
mecab normally also split things like ます, た form... But I added some logic to make it as close as possible to the existing tokenizer (minus the greediness).
Mecab output :
+もう一度、聞くわ。──どうして私を、『嫉妬の魔女』の名で呼ぶの もう一度 副詞,一般,*,*,*,*,もう一度,モウイチド,モーイチド 、 記号,読点,*,*,*,*,、,、,、 聞く 動詞,自立,*,*,五段・カ行イ音便,基本形,聞く,キク,キク わ 助詞,終助詞,*,*,*,*,わ,ワ,ワ 。 記号,句点,*,*,*,*,。,。,。 ─ 記号,一般,*,*,*,*,─,─,─ ─ 記号,一般,*,*,*,*,─,─,─ どうして 副詞,一般,*,*,*,*,どうして,ドウシテ,ドーシテ 私 名詞,代名詞,一般,*,*,*,私,ワタシ,ワタシ を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ 、 記号,読点,*,*,*,*,、,、,、 『 記号,括弧開,*,*,*,*,『,『,『 嫉妬 名詞,サ変接続,*,*,*,*,嫉妬,シット,シット の 助詞,連体化,*,*,*,*,の,ノ,ノ 魔女 名詞,一般,*,*,*,*,魔女,マジョ,マジョ 』 記号,括弧閉,*,*,*,*,』,』,』 の 助詞,連体化,*,*,*,*,の,ノ,ノ 名 名詞,一般,*,*,*,*,名,ナ,ナ で 助詞,格助詞,一般,*,*,*,で,デ,デ 呼ぶ 動詞,自立,*,*,五段・バ行,基本形,呼ぶ,ヨブ,ヨブ の 助詞,終助詞,*,*,*,*,の,ノ,ノHandling of merges :
const shouldMerge = ( // 助動詞 or 動詞-接尾 (but not after 記号) ((tokenPos === '助動詞' || (tokenPos === '動詞' && tokenPos2 === '接尾')) && last_token.pos !== '記号') || // て/で particle after verb (tokenPos === '助詞' && tokenPos2 === '接続助詞' && (term === 'て' || term === 'で') && last_token.pos === '動詞') ); if (shouldMerge) { line.pop(); term = last_token.term + term; reading = last_token.reading + reading; source = last_token.source + source; }Another big perk is how fast it can be to parse huge texts. Even with some optimization like block-tokenizing with the simple endpoint, I was able to parse the full Oppenheimer srt files in about 600ms instead of ~95s with the simple tokenizer.
It also allow for near realtime tokenizing sentences, useful for projects like asbplayer or in this case my fork of ebook-reader (ttsu-reader)
https://github.com/user-attachments/assets/3847cd0f-e3b8-41d7-a3e1-e02de35500a5
Each tokenize range between 2-3ms instead of 25-100ms for the simple one on my computer.
To keep things backward compatible, if the parser is not set in the query, I fallback to simple parser. But if you add "parser: mecab", it will use mecab.
Something I could take a look on is to, by default (if no parser are specified), use mecab if the user selected mecab in its user options.
Any thoughts/recommendations are of course welcomed.
Actions ;