What should be the norm_mode for different languages?

I see that the norm_mode is defined as the following values in https://github.com/tesseract-ocr/tesseract/blob/master/src/training/unicharset_extractor.cpp#L103

1 - combine graphemes (use for Latin and other simple scripts)
2 - split graphemes (use for Indic/Khmer/Myanmar)
3 - pure unicode (use for Arabic/Hebrew/Thai/Tibetan)

Can someone clarify in the documentation the exact mapping for the all the available languages in the tessdata repos?

It is pretty confusing to me that the NORM_MODE defined in the tesstrain Makefile almost never uses the values for Latin languages. https://github.com/tesseract-ocr/tesstrain/blob/main/Makefile#L86-L101

Should norm_mode be 2 even for English according to the Makefile?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What should be the norm_mode for different languages? #99

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

What should be the norm_mode for different languages? #99

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions