Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
## 0.18.33-dev0
## 0.18.33-dev1

### Enhancements
- **Add `group_elements_by_parent_id` utility function**: Groups elements by their `parent_id` metadata field for easier document hierarchy traversal (fixes #1489)

### Fixes
- **Preserve newlines in Table/TableChunk elements during PDF partitioning**: Skip whitespace normalization for Table and TableChunk elements so newlines that carry structural meaning (row separation) are preserved (fixes #3983)

## 0.18.32

### Enhancements
Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.18.33-dev0" # pragma: no cover
__version__ = "0.18.33-dev1" # pragma: no cover
17 changes: 12 additions & 5 deletions unstructured/partition/pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,8 @@
Link,
ListItem,
PageBreak,
Table,
TableChunk,
Text,
Title,
)
Expand Down Expand Up @@ -823,11 +825,16 @@ def _partition_pdf_or_image_local(
out_elements.append(cast(Element, el))
# NOTE(crag): this is probably always a Text object, but check for the sake of typing
elif isinstance(el, Text):
el.text = re.sub(
RE_MULTISPACE_INCLUDING_NEWLINES,
" ",
el.text or "",
).strip()
if isinstance(el, (Table, TableChunk)):
# For Table/TableChunk, preserve newlines (they carry structural meaning)
# but still collapse multiple horizontal whitespace (spaces, tabs) to single space
el.text = re.sub(r"[^\S\n]+", " ", el.text or "").strip()
else:
el.text = re.sub(
RE_MULTISPACE_INCLUDING_NEWLINES,
" ",
el.text or "",
).strip()
if el.text or isinstance(el, PageBreak):
out_elements.append(cast(Element, el))

Expand Down
Loading