Skip to content

Commit 318330b

Browse files
committed
fix: preserve newlines in Table and TableChunk elements during PDF partitioning
The RE_MULTISPACE_INCLUDING_NEWLINES regex was being applied to all Text elements, including Table and TableChunk. This incorrectly removed newline characters that carry structural meaning in tables (row separation). Fixes #3983
1 parent 4bbb1ff commit 318330b

File tree

3 files changed

+15
-6
lines changed

3 files changed

+15
-6
lines changed

CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
## 0.18.33
2+
3+
### Fixes
4+
- **Preserve newlines in Table/TableChunk elements during PDF partitioning**: Skip whitespace normalization for Table and TableChunk elements so newlines that carry structural meaning (row separation) are preserved (fixes #3983)
5+
16
## 0.18.32
27

38
### Enhancements

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.18.32" # pragma: no cover
1+
__version__ = "0.18.33" # pragma: no cover

unstructured/partition/pdf.py

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,8 @@
3434
Link,
3535
ListItem,
3636
PageBreak,
37+
Table,
38+
TableChunk,
3739
Text,
3840
Title,
3941
)
@@ -823,11 +825,13 @@ def _partition_pdf_or_image_local(
823825
out_elements.append(cast(Element, el))
824826
# NOTE(crag): this is probably always a Text object, but check for the sake of typing
825827
elif isinstance(el, Text):
826-
el.text = re.sub(
827-
RE_MULTISPACE_INCLUDING_NEWLINES,
828-
" ",
829-
el.text or "",
830-
).strip()
828+
# Skip newline normalization for Table/TableChunk - newlines carry structural meaning
829+
if not isinstance(el, (Table, TableChunk)):
830+
el.text = re.sub(
831+
RE_MULTISPACE_INCLUDING_NEWLINES,
832+
" ",
833+
el.text or "",
834+
).strip()
831835
if el.text or isinstance(el, PageBreak):
832836
out_elements.append(cast(Element, el))
833837

0 commit comments

Comments
 (0)