Description
Is your feature request related to a problem? Please describe.
Yes. When attempting to parse a PDF which uses a font that has ligatures, i.e. two or more characters are expressed as a single character, the character is not recognised in either the fast
or hi-res
strategies.
In the example PDF (attached), the letters fi
are represented by a single conjoined character by the font.
When using the fast
strategy the fi
ligature is replaced with \u0000
. For example: the word verifies
becomes vari\u0000es
and the word specification
becomes speci\u0000cation
.
When using the hi-res
strategy the fi
ligature is omitted entirely. For example: the word verifies
becomes veries
and the word specification
becomes specication
.
Describe the solution you'd like
I understand it is probably not possible to correct this in the fast
strategy to understand the ligature.
It would be nice if the hi-res
OCR strategy could imply any ligatures. It may not get it correct every time, but that's probably better than not at all.
Describe alternatives you've considered
I tried the hi-res
strategy thinking OCR might work, but it didn't.
Additional context
I have attached the source PDF as an example, and the two JSON result files (fast.json
and hi-res.json
). Every instance of the character combinationfi
is replaced with a single character ligature, and thus results in a suboptimal output.