Skip to content

Commit 3fc5a33

Browse files
committed
fix: preserve newlines in Table and TableChunk elements during PDF partitioning
The RE_MULTISPACE_INCLUDING_NEWLINES regex was being applied to all Text elements, including Table and TableChunk. This incorrectly removed newline characters that carry structural meaning in tables (row separation). Fixes #3983
1 parent 4bbb1ff commit 3fc5a33

File tree

2 files changed

+12
-5
lines changed

2 files changed

+12
-5
lines changed

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,9 @@
33
### Enhancements
44
- put `pdfium` calls behind a thread lock
55

6+
### Fixes
7+
- **Preserve newlines in Table/TableChunk elements during PDF partitioning**: Skip whitespace normalization for Table and TableChunk elements so newlines that carry structural meaning (row separation) are preserved (fixes #3983)
8+
69
## 0.18.31
710

811
### Enhancements

unstructured/partition/pdf.py

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,8 @@
3434
Link,
3535
ListItem,
3636
PageBreak,
37+
Table,
38+
TableChunk,
3739
Text,
3840
Title,
3941
)
@@ -823,11 +825,13 @@ def _partition_pdf_or_image_local(
823825
out_elements.append(cast(Element, el))
824826
# NOTE(crag): this is probably always a Text object, but check for the sake of typing
825827
elif isinstance(el, Text):
826-
el.text = re.sub(
827-
RE_MULTISPACE_INCLUDING_NEWLINES,
828-
" ",
829-
el.text or "",
830-
).strip()
828+
# Skip newline normalization for Table/TableChunk - newlines carry structural meaning
829+
if not isinstance(el, (Table, TableChunk)):
830+
el.text = re.sub(
831+
RE_MULTISPACE_INCLUDING_NEWLINES,
832+
" ",
833+
el.text or "",
834+
).strip()
831835
if el.text or isinstance(el, PageBreak):
832836
out_elements.append(cast(Element, el))
833837

0 commit comments

Comments
 (0)