Skip to content

Conversation

@ShanaryS
Copy link

With the new Yomitan API, asbplayer is now using it for long time requested features killergerbah/asbplayer#813. The issue is that since Yomitan tokenize wasn't designed for heavy work flows such as parsing an entire subtitle, it leaves a few simple optimizations on the table.

This PR includes 3 optimizations:

  • Support concurrent tokenization (API text can be string[])
  • Return headwords when tokenizing (no longer need to duplicate tokenize internal termEntries call for lemmatize)
  • Use LRU cache for tokenize (320MB at 10k Japanese entries, clears after 1 hour and settings change, profile aware)

These give a performance improvement of 2.7x with no changes to the algorithm, taking the time to process a 3 hour Japanese subtitle from 5m18s to 1m58s which significantly improves the asbplayer experience. Here is a full breakdown:

  Tokenize Lemmatize Total Memory Usage
Baseline 222.453s 96.623s 319.076s 0.4GB
Parallel (1k) + Caching (10k) + Headwords 117.552s 0.835s 118.387s 1.7GB*
Parallel (1k) 139.642s 93.357s 232.999s 1.5GB*
Headwords 244.836s 0.875s 245.711s 0.4GB
Caching (10k) 190.404s 97.331s 287.735s 0.6GB

* Memory usage depends on how much the API user batches

@ShanaryS ShanaryS requested a review from a team as a code owner November 29, 2025 17:51
@jamesmaa
Copy link
Collaborator

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@Kuuuube
Copy link
Member

Kuuuube commented Dec 10, 2025

Gave this a test and it seems fine. Only thing is I'm a bit wary of adding all this logic into backend.js and the parser itself if there's not going to be any use outside the api. I'll take a closer look at that (hopefully tomorrow) and if it looks fine I'll merge.

@ShanaryS
Copy link
Author

ShanaryS commented Dec 10, 2025

Only thing is I'm a bit wary of adding all this logic into backend.js and the parser itself if there's not going to be any use outside the api.

I didn't try to use it elsewhere as I don't think Yomitan handles this much text usually and it probably would have made this PR much bigger.

I'm not sure if we do need ignore the cache when there is no profileCurrent passed though since the cache will be reset on settings change anyways. It can be removed so it's always used if _textParseScanning() only applies to the current profile implicitly if profileCurrent is omitted. I do not feel confident in assessing that so I left it explicit.

Copy link
Member

@Kuuuube Kuuuube left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to work fine. Performance in Yomitan looks good. I will trust you on the more intense benchmarking results.

This will need an update to the API docs. I won't hold this change back from the next dev build for this since it's not breaking but if you can get to that it would be appreciated.

@Kuuuube Kuuuube added this pull request to the merge queue Dec 14, 2025
Merged via the queue into yomidevs:master with commit d82684d Dec 14, 2025
25 checks passed
@ShanaryS ShanaryS deleted the tokenize-speedup branch December 14, 2025 19:56
@Kuuuube Kuuuube added kind/enhancement The issue or PR is a new feature or request area/performance The issue or PR is related to performance labels Dec 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/performance The issue or PR is related to performance kind/enhancement The issue or PR is a new feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants