Speed up tokenization for heavy workflows #2251

ShanaryS · 2025-11-29T17:51:32Z

With the new Yomitan API, asbplayer is now using it for long time requested features killergerbah/asbplayer#813. The issue is that since Yomitan tokenize wasn't designed for heavy work flows such as parsing an entire subtitle, it leaves a few simple optimizations on the table.

This PR includes 3 optimizations:

Support concurrent tokenization (API text can be string[])
Return headwords when tokenizing (no longer need to duplicate tokenize internal termEntries call for lemmatize)
Use LRU cache for tokenize (320MB at 10k Japanese entries, clears after 1 hour and settings change, profile aware)

These give a performance improvement of 2.7x with no changes to the algorithm, taking the time to process a 3 hour Japanese subtitle from 5m18s to 1m58s which significantly improves the asbplayer experience. Here is a full breakdown:

	Tokenize	Lemmatize	Total	Memory Usage
Baseline	`222.453s`	`96.623s`	`319.076s`	`0.4GB`
Parallel (1k) + Caching (10k) + Headwords	`117.552s`	`0.835s`	`118.387s`	`1.7GB*`
Parallel (1k)	`139.642s`	`93.357s`	`232.999s`	`1.5GB*`
Headwords	`244.836s`	`0.875s`	`245.711s`	`0.4GB`
Caching (10k)	`190.404s`	`97.331s`	`287.735s`	`0.6GB`

* Memory usage depends on how much the API user batches

jamesmaa · 2025-11-30T17:45:51Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ext/js/background/backend.js

Kuuuube · 2025-12-10T03:19:41Z

Gave this a test and it seems fine. Only thing is I'm a bit wary of adding all this logic into backend.js and the parser itself if there's not going to be any use outside the api. I'll take a closer look at that (hopefully tomorrow) and if it looks fine I'll merge.

ShanaryS · 2025-12-10T16:41:00Z

Only thing is I'm a bit wary of adding all this logic into backend.js and the parser itself if there's not going to be any use outside the api.

I didn't try to use it elsewhere as I don't think Yomitan handles this much text usually and it probably would have made this PR much bigger.

I'm not sure if we do need ignore the cache when there is no profileCurrent passed though since the cache will be reset on settings change anyways. It can be removed so it's always used if _textParseScanning() only applies to the current profile implicitly if profileCurrent is omitted. I do not feel confident in assessing that so I left it explicit.

Kuuuube

Seems to work fine. Performance in Yomitan looks good. I will trust you on the more intense benchmarking results.

This will need an update to the API docs. I won't hold this change back from the next dev build for this since it's not breaking but if you can get to that it would be appreciated.

ShanaryS requested a review from a team as a code owner November 29, 2025 17:51

ShanaryS added 3 commits November 29, 2025 13:12

support tokenization concurrently

a715610

return headwords when tokenizing

a053dda

use LRU cache for tokenize

e93838a

ShanaryS force-pushed the tokenize-speedup branch from da71e73 to e93838a Compare November 29, 2025 18:16

chatgpt-codex-connector bot reviewed Nov 30, 2025

View reviewed changes

ext/js/background/backend.js Show resolved Hide resolved

ext/js/background/backend.js Outdated Show resolved Hide resolved

ShanaryS added 3 commits November 30, 2025 13:18

fix refactor bug

6361c03

maintain separation between parse results

ae838ff

add array index to results

af1d8e5

ShanaryS mentioned this pull request Dec 4, 2025

enhancement(yomitan): support optimizations for tokenization killergerbah/asbplayer#838

Merged

Kuuuube approved these changes Dec 13, 2025

View reviewed changes

ShanaryS mentioned this pull request Dec 14, 2025

Document tokenize api changes yomidevs/yomitan-api#15

Merged

Kuuuube added this pull request to the merge queue Dec 14, 2025

Merged via the queue into yomidevs:master with commit d82684d Dec 14, 2025
25 checks passed

ShanaryS deleted the tokenize-speedup branch December 14, 2025 19:56

Kuuuube added kind/enhancement The issue or PR is a new feature or request area/performance The issue or PR is related to performance labels Dec 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up tokenization for heavy workflows #2251

Speed up tokenization for heavy workflows #2251

Uh oh!

ShanaryS commented Nov 29, 2025

Uh oh!

jamesmaa commented Nov 30, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Uh oh!

Kuuuube commented Dec 10, 2025

Uh oh!

ShanaryS commented Dec 10, 2025 •

edited

Loading

Uh oh!

Kuuuube left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Speed up tokenization for heavy workflows #2251

Speed up tokenization for heavy workflows #2251

Uh oh!

Conversation

ShanaryS commented Nov 29, 2025

Uh oh!

jamesmaa commented Nov 30, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Kuuuube commented Dec 10, 2025

Uh oh!

ShanaryS commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Kuuuube left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ShanaryS commented Dec 10, 2025 •

edited

Loading