-
Notifications
You must be signed in to change notification settings - Fork 77
Description
I'd like to try and implement OCR support for Japanese (when time permits). I don't expect to finish anything soon, as I'm very inexperienced with OCR. I'm mainly making this issue to track/coordinate my work in case anyone else is interesting in contributing or wants to offering any advice.
I'd personally like to focus the OCR on manga content initially. However, I see a general purpose OCR as being the final goal. If we get to that point, I'm not sure if a separate model optimized for manga would be in scope of this projects direction or not.
Related
1. Challenges
There's a few major differences from latin scripts which will need to be addressed.
a) Kanji
There are many more kanji than there is latin characters. Probably around 2,000 common kanji and in the order of 10,000 currently used kanji.
b) Layout: Horizontal / Vertical
Text can be written either vertically (縦書き) or horizontally (横書き).
c) Annotations: Furigana / Ruby text
Text can have annotations either on the right (for vertical text) or above (for horizontal text).
This text usually explains how certain words written in Kanji should be read, but can also be used by authors to provide synonyms, nuances, etc. It is therefore valuable to extract in the OCR, however, since it's only adding additional information to the base text, it should be possible to separate it from the base text in the OCRs output.
This is definitely going to require the WIP layout engine.
d) Fonts / Handwriting
I think various fonts can should be supported however I think handwriting be too difficult initially as there can be quite a big difference between digital characters and handwritten characters. I would propose a working OCR engine/model for digital text is implemented first, and then handwritten text can be optimized and trained later.
2. Training Data
I will need to conduct more research into this...
a) Datasets
- Manga109-s
- Available for commercial use (with some nuances)
- Cannot be redistributed (should such a dataset even be considered with ocrs requirements for datasets?)
b) Synthetic data
In the absence of a good dataset, one possibility is to generate synthetic data. This was used in robertknight/mana-ocr in this synthetic data generator. I'm thinking we could start with this until a good dataset is found, made, or becomes available for use.

