Skip to content

bug/RE_MULTISPACE_INCLUDING_NEWLINES was incorrectly used for Table or TableChunk #3983

Open
@JIAQIA

Description

@JIAQIA

🐞 Describe the bug

RE_MULTISPACE_INCLUDING_NEWLINES is applied to all elements of the Text category after partitioning PDF files. The relevant code is shown below:

out_elements = []
for el in elements:
    if isinstance(el, PageBreak) and not include_page_breaks:
        continue

    if isinstance(el, Image):
        out_elements.append(cast(Element, el))
    # NOTE(crag): this is probably always a Text object, but check for the sake of typing
    elif isinstance(el, Text):
        el.text = re.sub(
            RE_MULTISPACE_INCLUDING_NEWLINES,
            " ",
            el.text or "",
        ).strip()
        if el.text or isinstance(el, PageBreak):
            out_elements.append(cast(Element, el))

File path: unstructured/partition/pdf.py

However, if the element is a Table or TableChunk, the newline character "\n" is important and should not be removed in this context.


🔁 To Reproduce

  1. Use partition_pdf on an image-based PDF that includes a table.
  2. Observe that the newline characters within table content are removed by the above code.

Expected behavior

Newlines should not be removed from Table or TableChunk elements.


🖼 Screenshots

If applicable, add screenshots to help illustrate the issue.


🧰 Environment Info

  • OS version: macOS 13.6.7 (arm64)
  • Python version: 3.10.14
  • unstructured version: None
  • unstructured-inference version: 0.7.36
  • pytesseract version: 0.3.10
  • Torch version: 2.3.0
  • Detectron2: Not installed
  • PaddleOCR: Not installed
  • Libmagic version: libmagic 5.46 (bottled)
  • LibreOffice version: 25.2.1

⚠️ Note: There were warnings about pip version checks failing.


📌 Additional context

The issue likely arises from applying the regex substitution to all Text elements indiscriminately, including those derived from tables, where "\n" conveys meaningful structure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions