Open
Description
🐞 Describe the bug
RE_MULTISPACE_INCLUDING_NEWLINES
is applied to all elements of the Text
category after partitioning PDF files. The relevant code is shown below:
out_elements = []
for el in elements:
if isinstance(el, PageBreak) and not include_page_breaks:
continue
if isinstance(el, Image):
out_elements.append(cast(Element, el))
# NOTE(crag): this is probably always a Text object, but check for the sake of typing
elif isinstance(el, Text):
el.text = re.sub(
RE_MULTISPACE_INCLUDING_NEWLINES,
" ",
el.text or "",
).strip()
if el.text or isinstance(el, PageBreak):
out_elements.append(cast(Element, el))
File path: unstructured/partition/pdf.py
However, if the element is a Table
or TableChunk
, the newline character "\n"
is important and should not be removed in this context.
🔁 To Reproduce
- Use
partition_pdf
on an image-based PDF that includes a table. - Observe that the newline characters within table content are removed by the above code.
✅ Expected behavior
Newlines should not be removed from Table
or TableChunk
elements.
🖼 Screenshots
If applicable, add screenshots to help illustrate the issue.
🧰 Environment Info
- OS version: macOS 13.6.7 (arm64)
- Python version: 3.10.14
unstructured
version: Noneunstructured-inference
version: 0.7.36pytesseract
version: 0.3.10- Torch version: 2.3.0
- Detectron2: Not installed
- PaddleOCR: Not installed
- Libmagic version: libmagic 5.46 (bottled)
- LibreOffice version: 25.2.1
⚠️ Note: There were warnings about pip version checks failing.
📌 Additional context
The issue likely arises from applying the regex substitution to all Text
elements indiscriminately, including those derived from tables, where "\n"
conveys meaningful structure.