Skip to content

What should be the norm_mode for different languages? #99

Open
@girikum

Description

@girikum

I see that the norm_mode is defined as the following values in https://github.com/tesseract-ocr/tesseract/blob/master/src/training/unicharset_extractor.cpp#L103

1 - combine graphemes (use for Latin and other simple scripts)
2 - split graphemes (use for Indic/Khmer/Myanmar)
3 - pure unicode (use for Arabic/Hebrew/Thai/Tibetan)

Can someone clarify in the documentation the exact mapping for the all the available languages in the tessdata repos?

It is pretty confusing to me that the NORM_MODE defined in the tesstrain Makefile almost never uses the values for Latin languages. https://github.com/tesseract-ocr/tesstrain/blob/main/Makefile#L86-L101

Should norm_mode be 2 even for English according to the Makefile?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions