Skip to content

ToUnicode patch corrupts Tibetan text into Latin/ASCII gibberish (GID mismatch) #3

@10zYmon

Description

@10zYmon

Description

When running pdf-cmap-fix on the document TI1844-01-001.pdf, the script degrades the text extraction rather than fixing it.
Folder name: IE3KG730
File name: TI1844-01-001.pdf
Fonts: BookmanOldStyle, Ededris-a, Ededris-a1, Ededris-b, Ededris-b1, Ededris-vowa, Jomolhari, Kailasa, MonlamUniOuChan2, Cambria, Kailasa, Calibri, #e6#96#87#e9#bc#8e#e7#b2#97#e6#af#9b#e6#a5#

RAW Output:

(RAW): ༄༅། ། ༧ ས་་དམ་པ་ད་བན་ར་་འགས་ད་ན་གས་

PATCHED Output:

(PATCHED): ༄༅# # ( Tú!Ö!f∞!u!çf!£Áq!të!¤!ÎV<ú!≥f!yq!∫<ú!
The script overwrites the existing ToUnicode map with incorrect Latin/ASCII characters, turning the text into complete gibberish.

Sample page for testing :
sample.pdf

Subtasks

  • implement gname to unicode mapping if gid to unicode fails
  • iimplement hash or glyph curve to unicode mapping , gshape-> unicode if both of above fails.
  • using gname find unicode for pua cases.
  • do joining on gname and gshape to updated gshape mappings.
  • update cli endpoints
  • run test
  • documentation of repo

Metadata

Metadata

Labels

No labels
No labels

Type

No fields configured for Task.

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions