-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
🐞 Describe the bug
RE_MULTISPACE_INCLUDING_NEWLINES is applied to all elements of the Text category after partitioning PDF files. The relevant code is shown below:
out_elements = []
for el in elements:
if isinstance(el, PageBreak) and not include_page_breaks:
continue
if isinstance(el, Image):
out_elements.append(cast(Element, el))
# NOTE(crag): this is probably always a Text object, but check for the sake of typing
elif isinstance(el, Text):
el.text = re.sub(
RE_MULTISPACE_INCLUDING_NEWLINES,
" ",
el.text or "",
).strip()
if el.text or isinstance(el, PageBreak):
out_elements.append(cast(Element, el))File path: unstructured/partition/pdf.py
However, if the element is a Table or TableChunk, the newline character "\n" is important and should not be removed in this context.
🔁 To Reproduce
- Use
partition_pdfon an image-based PDF that includes a table. - Observe that the newline characters within table content are removed by the above code.
✅ Expected behavior
Newlines should not be removed from Table or TableChunk elements.
🖼 Screenshots
If applicable, add screenshots to help illustrate the issue.
🧰 Environment Info
- OS version: macOS 13.6.7 (arm64)
- Python version: 3.10.14
unstructuredversion: Noneunstructured-inferenceversion: 0.7.36pytesseractversion: 0.3.10- Torch version: 2.3.0
- Detectron2: Not installed
- PaddleOCR: Not installed
- Libmagic version: libmagic 5.46 (bottled)
- LibreOffice version: 25.2.1
⚠️ Note: There were warnings about pip version checks failing.
📌 Additional context
The issue likely arises from applying the regex substitution to all Text elements indiscriminately, including those derived from tables, where "\n" conveys meaningful structure.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working