Skip to content

LSTM Engine Diplopia Issue and Inaccurate HOCR Character Level Box Dimensions #3477

Open
@woodjohndavid

Description

@woodjohndavid

Environment: Tesseract Latest Master from GitHub, Ubuntu 20.04.2

User References: @bertsky @stweil

BackGround

The problem named Diplopia (courtesy of @bertsky) consists in there being more than 1 character appearing in the LSTM output character stream for what is the same physical area of the original image.

I encountered this issue early on in my use of Tesseract, and reported it on earlier thread #2738 . It has also been reported by many others. I then attempted to implement a workaround outside of the Tesseract code itself, using the HOCR output format character level box dimensions to try to identify overlapping characters. This was unsuccessful because, as it turns out, the character level box dimensions are inaccurate for LSTM generally, and are in fact guaranteed to be inaccurate when diplopia occurs.

So I then downloaded the latest Tesseract Master code and embarked on an expedition to try to understand how it works and see if I could come up with a fix for diplopia. The rest of this post documents the key results of my investigation.

Initial Diplopia Fix

I have just now created Pull Request #3476 which I hope is an adequate fix for most diplopia cases. See the PR for more details.

This fix generally follows the current style of the RecodeBeamSearch which attempts to assemble the character level output stream from the lower level LSTM NetworkIO matrix output. This matrix output delivers a set of entries for each timestep in the LSTM process, each entry consisting of a potential matching character and an likelihood score (key) in the range from 0.0 to 1.0.

There is nothing in the current matrix output that identifies the physical location of the possible match in the source image. Consequently, my fix attempts to identify possible diplopia by looking for two matrix output entries in a given timestep which have what could be called a 'meaningful' score, that is, a score that is high enough to indicate it is likely a 'real' match. If two such entries are found in the same timestep, then the fix tries to prevent any beam from subsequently containing both.

Inaccurate LSTM HOCR Character Level Box Dimensions

I had originally tried to use the HOCR dimensions as a workaround to fix diplopia, but found them inaccurate. I then pursued the diplopia fix above separately from this issue, but I have looked at how these dimensions are created and am of the opinion that the current implementation cannot ever be successful. What it does now consists of three sequential stages:

  1. In the initial image segmenting (before either legacy or LSTM engine is called for recognition) the image is divided into words and blobs within words. I assume (but have not verified) that for the legacy engine each blob is handled individually to try to identify which single character it may match. For the LSTM engine, the blobs are re-assembled back into a word level image, and then handed to the engine for recognition. In both cases, the original blob dimensions are saved for later.
  2. During its recognition process, the LSTM engine processes the word level image in a series of so-called timesteps. These are in fact traversing the image from left to right a certain number of pixels per timestep. During the RecodeBeamSearch processing to assemble the output character stream, there is a process which attempts to calculate the character dimensions using the known overall dimensions of the word image plus the known number of timesteps. At the currently defined timestep size, the granularity level is too coarse so this process will never be able to be accurate. I have experimented with reducing the size of the timesteps, which does result in some improvement in the character box dimension accuracy, but at the expense of extra processing time and reduced recognition accuracy. Perhaps retraining is necessary if the timestep size is reduced.
  3. The final stage is the assembly of the HOCR output. If the legacy engine was used, then the original blob dimensions from step#1 are used, which is as good as it can get in that case. If the LSTM engine was used, then there is code which tries to decide whether the original blob dimensions or the calculated ones from step#2 are 'better'. In general, if the number of characters found in a word by LSTM is the same as the original blob count, then it seems to use the blob dimensions and ignores the calculated ones. When diplopia occurs, this mechanism is completely unsuccessful.

Long Term Solution to LSTM Diplopia and Character Box Dimensions

So as it turns out, these issues are in fact related, or at least the solution for both is. What both of them really need is the precise physical image location of the character match being attempting. If the character box dimensions were accurate, then diplopia could be solved either during the RecodeBeamSearch, or after the LSTM engine has done its thing. It would have to be determined how much of a physical overlap would mean we have diplopia, but that could be an easy configuration setting.

As I see it, therefore the LSTM matrix processing using the NetworkIO interface needs to add to its return values (in addition to the possible character and the likelihood score) the starting pixel location of the possible match, and the horizontal size of the potential match image from the train data. Once that is done, the rest should be relatively straightforward.

Having said that, I have spent a fair bit of time to try to understand the matrix operations, but so far have not been successful in how to accomplish the above suggestion. It MUST be the case that somewhere down in there that location information can be retrieved, and I intend to continue to look. But if anybody can give me some hints, it would be appreciated.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions