Skip to content

Erroneous Whitespace in Text Extraction #3219

Open
@hackowitz-af

Description

@hackowitz-af

Erroneous whitespace is added within words during text extraction. The errors are inconsistent and not always possible (or ever easy) to resolve post-extraction. For example, the text MISSION STORE MONITOR.RESERVED (verified copy/pasting from Acrobat) is extracted as WO RD NAME: M ISSION S TORE MONITOR.RESERVED

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-5.10.228-219.884.amzn2.x86_64-x86_64-with-glibc2.35

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.3.1, crypt_provider=('cryptography', '3.4.8'), PIL=10.2.0

Code + PDF

This is a minimal, complete example that shows the issue:

reader = pypdf.PdfReader('./spacey-clean.pdf')
page, = reader.pages
page.extract_text()
'  01 January 1969 \n \n5-\n208 \nABCD-01234-012 Revision A  \n5.16.6   WO RD NAME: MISSION STORE MONITOR.RESERVED \n  CATEGORY: N/A \nWORD ID: MAX VALUE: N/A \nSOURCE(s): MIN VALUE: N/A \nDEST(s): RESOLUTION: N/A \nCOMP RATE: ACCURACY: N/A \nXMIT RATE: MSB: N/A \nSIGNAL TYPE: LSB: N/A \nUNITS: \n69Q-04...20 \nWeapon \nA/C or Carriage System \nN/A \nAperiodic \nN/A \nN/A \nFULL SCALE: N/A \n FIELD NAME BIT NO. DESCRIPTION \n \nReserved  - 00 -0  \n \n  - 01 -0 \n \n  - 02 -0 \n \n  - 03 -0 \n \n  - 04 -0 \n \n  - 05 -0 \n \n  - 06 -0 \n \n  - 07 -0 \n    \n  - 08 -0 \n \n  - 09 -0 \n \n  - 10 -0 \n \n  - 11 -0 \n \n  - 12 -0 \n \n  - 13 -0 \n \n  - 14 -0 \n \n  - 15 -0  \n \nREMARKS/NOTES: \n1.  Reserved per MIL-STD-1760 \n \n \n  '

Notice the whitespace within "WORD". The issue gets worse when exttraction_mode='layout':

page.extract_text(extraction_mode='layout')
'ABCD-01234-012 Revision              A                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               01 January     1969\n\n5.16.6   WO  RD NAME: M ISSION S  TORE MONITOR.RESERVED\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     CATEGORY:                                                                                                                                                                                                                                                                                                                         N/A\n                WORD ID:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          MAX VALUE:                                                                                                                                                                            N/A69Q-04...20\n                SOURCE(s):                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 MIN VALUE:                                                                                                                                                                                            N/AWeapon\n                DEST(s):                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     RESOLUTION:                                                                                                                                                                N/AA/C or Ca rriage S ystem\n                COMP RAT E:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ACCURACY:                                                                                                                                                                                             N/AN/A\n                XMIT RATE:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           MSB:                                                                                                                                                                                                                                                                                                                                       N/AAperiodic\n                SIGNAL TYPE:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   LSB:                                                                                                                                                                                                                                                                                                                                                  N/AN/A\n                UNITS:                                                                                                                                                                                                                                                                                                  N/A                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          FULL SCALE:                                                                                                                                                                               N/A\n                                                                                                                                   FIELD NAME                                                                                                                                                                                                                                                                           BIT NO.                                                                                                                                                                                                                                                                                                                                                                                                                                                DESCRIPTION\n\n                Reserved                                                                                                                                                                                                                                                                                                                                                                   - 00 -0\n\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        - 01 -0\n\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        - 02 -0\n\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        - 03 -0\n\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        - 04 -0\n\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        - 05 -0\n\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        - 06 -0\n\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        - 07 -0\n\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        - 08 -0\n\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        - 09 -0\n\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        - 10 -0\n\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        - 11 -0\n\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        - 12 -0\n\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        - 13 -0\n\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        - 14 -0\n\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        - 15 -0\n\n                                REMARKS/NOTES:\n                1.  Reserved per MIL                              -STD-1760\n\n\n\n\n\n\n\n\n\n\n\n\n\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           5-208'

Not only are there more whitespace errors, the horizontal spacing is not representative of the source document.

I have modified the original text of the document to make it publicly releasable. Feel free to use it in tests or ask for more examples. I have something around 100k pages of examples.
spacey-clean.pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    whitespaceWhile doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard.workflow-text-extractionFrom a users perspective, text extraction is the affected feature/workflow

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions