Description
Erroneous whitespace is added within words during text extraction. The errors are inconsistent and not always possible (or ever easy) to resolve post-extraction. For example, the text MISSION STORE MONITOR.RESERVED
(verified copy/pasting from Acrobat) is extracted as WO RD NAME: M ISSION S TORE MONITOR.RESERVED
Environment
Which environment were you using when you encountered the problem?
$ python -m platform
Linux-5.10.228-219.884.amzn2.x86_64-x86_64-with-glibc2.35
$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.3.1, crypt_provider=('cryptography', '3.4.8'), PIL=10.2.0
Code + PDF
This is a minimal, complete example that shows the issue:
reader = pypdf.PdfReader('./spacey-clean.pdf')
page, = reader.pages
page.extract_text()
' 01 January 1969 \n \n5-\n208 \nABCD-01234-012 Revision A \n5.16.6 WO RD NAME: MISSION STORE MONITOR.RESERVED \n CATEGORY: N/A \nWORD ID: MAX VALUE: N/A \nSOURCE(s): MIN VALUE: N/A \nDEST(s): RESOLUTION: N/A \nCOMP RATE: ACCURACY: N/A \nXMIT RATE: MSB: N/A \nSIGNAL TYPE: LSB: N/A \nUNITS: \n69Q-04...20 \nWeapon \nA/C or Carriage System \nN/A \nAperiodic \nN/A \nN/A \nFULL SCALE: N/A \n FIELD NAME BIT NO. DESCRIPTION \n \nReserved - 00 -0 \n \n - 01 -0 \n \n - 02 -0 \n \n - 03 -0 \n \n - 04 -0 \n \n - 05 -0 \n \n - 06 -0 \n \n - 07 -0 \n \n - 08 -0 \n \n - 09 -0 \n \n - 10 -0 \n \n - 11 -0 \n \n - 12 -0 \n \n - 13 -0 \n \n - 14 -0 \n \n - 15 -0 \n \nREMARKS/NOTES: \n1. Reserved per MIL-STD-1760 \n \n \n '
Notice the whitespace within "WORD". The issue gets worse when exttraction_mode='layout'
:
page.extract_text(extraction_mode='layout')
'ABCD-01234-012 Revision A 01 January 1969\n\n5.16.6 WO RD NAME: M ISSION S TORE MONITOR.RESERVED\n CATEGORY: N/A\n WORD ID: MAX VALUE: N/A69Q-04...20\n SOURCE(s): MIN VALUE: N/AWeapon\n DEST(s): RESOLUTION: N/AA/C or Ca rriage S ystem\n COMP RAT E: ACCURACY: N/AN/A\n XMIT RATE: MSB: N/AAperiodic\n SIGNAL TYPE: LSB: N/AN/A\n UNITS: N/A FULL SCALE: N/A\n FIELD NAME BIT NO. DESCRIPTION\n\n Reserved - 00 -0\n\n - 01 -0\n\n - 02 -0\n\n - 03 -0\n\n - 04 -0\n\n - 05 -0\n\n - 06 -0\n\n - 07 -0\n\n - 08 -0\n\n - 09 -0\n\n - 10 -0\n\n - 11 -0\n\n - 12 -0\n\n - 13 -0\n\n - 14 -0\n\n - 15 -0\n\n REMARKS/NOTES:\n 1. Reserved per MIL -STD-1760\n\n\n\n\n\n\n\n\n\n\n\n\n\n 5-208'
Not only are there more whitespace errors, the horizontal spacing is not representative of the source document.
I have modified the original text of the document to make it publicly releasable. Feel free to use it in tests or ask for more examples. I have something around 100k pages of examples.
spacey-clean.pdf