Skip to content

Conversation

@eureka928
Copy link
Contributor

@eureka928 eureka928 commented Jan 27, 2026

Closes #3983


Summary

This PR fixes an issue where newline characters were being incorrectly stripped from Table and TableChunk elements during PDF partitioning. The RE_MULTISPACE_INCLUDING_NEWLINES regex was being applied indiscriminately to all Text elements, including tables, which removed newlines that carry structural meaning (such as row separation).

Changes

  • unstructured/partition/pdf.py: Added conditional logic to skip whitespace normalization for Table and TableChunk elements, preserving newlines that convey tabular structure
  • CHANGELOG.md: Added entry documenting the fix
  • unstructured/__version__.py: Version bump to 0.18.33

Problem

When processing PDFs (especially image-based PDFs with tables), the code applied this regex substitution to all Text elements:

el.text = re.sub(RE_MULTISPACE_INCLUDING_NEWLINES, " ", el.text or "").strip()

This stripped meaningful line breaks from table content, degrading the structural representation of tabular data.

Solution

Added a check to exclude Table and TableChunk elements from the whitespace normalization:

# Skip newline normalization for Table/TableChunk - newlines carry structural meaning
if not isinstance(el, (Table, TableChunk)):
    el.text = re.sub(
        RE_MULTISPACE_INCLUDING_NEWLINES,
        " ",
        el.text or "",
    ).strip()

@eureka928 eureka928 force-pushed the fix/preserve-table-newlines branch 2 times, most recently from 3fc5a33 to 318330b Compare January 27, 2026 18:21
@badGarnet
Copy link
Collaborator

the ingest test is failing because multiple white space used to be replaced with just one but now they remain multiple ones -> results in text changed

  • the ticket only asked for new lines to be preserved and that seem reasonable for tables
  • but multiple white space (excluding new lines, so like two space together ) should still be replaced with just one to improve readability of the extracted content

eureka928 and others added 3 commits January 28, 2026 03:55
…rtitioning

The RE_MULTISPACE_INCLUDING_NEWLINES regex was being applied to all Text
elements, including Table and TableChunk. This incorrectly removed newline
characters that carry structural meaning in tables (row separation).

Fixes Unstructured-IO#3983
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@eureka928 eureka928 force-pushed the fix/preserve-table-newlines branch from 9fb28db to 4be5dc7 Compare January 28, 2026 03:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug/RE_MULTISPACE_INCLUDING_NEWLINES was incorrectly used for Table or TableChunk

2 participants