-
Notifications
You must be signed in to change notification settings - Fork 194
Speed up tokenization for heavy workflows #2251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
da71e73 to
e93838a
Compare
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Gave this a test and it seems fine. Only thing is I'm a bit wary of adding all this logic into backend.js and the parser itself if there's not going to be any use outside the api. I'll take a closer look at that (hopefully tomorrow) and if it looks fine I'll merge. |
I didn't try to use it elsewhere as I don't think Yomitan handles this much text usually and it probably would have made this PR much bigger. I'm not sure if we do need ignore the cache when there is no |
Kuuuube
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to work fine. Performance in Yomitan looks good. I will trust you on the more intense benchmarking results.
This will need an update to the API docs. I won't hold this change back from the next dev build for this since it's not breaking but if you can get to that it would be appreciated.
With the new Yomitan API, asbplayer is now using it for long time requested features killergerbah/asbplayer#813. The issue is that since Yomitan tokenize wasn't designed for heavy work flows such as parsing an entire subtitle, it leaves a few simple optimizations on the table.
This PR includes 3 optimizations:
textcan bestring[])tokenizeinternaltermEntriescall for lemmatize)320MBat 10k Japanese entries, clears after1 hourand settings change, profile aware)These give a performance improvement of
2.7xwith no changes to the algorithm, taking the time to process a 3 hour Japanese subtitle from5m18sto1m58swhich significantly improves the asbplayer experience. Here is a full breakdown:222.453s96.623s319.076s0.4GB117.552s0.835s118.387s1.7GB*139.642s93.357s232.999s1.5GB*244.836s0.875s245.711s0.4GB190.404s97.331s287.735s0.6GB* Memory usage depends on how much the API user batches