Skip to content
Closed
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
## 0.18.15

### Enhancements
- Speed up function _assign_hash_ids by 34% (codeflash)
- Speed up function ElementHtml._get_children_html by 234% (codeflash)
- Speed up function group_broken_paragraphs by 30% (codeflash)

Expand Down
14 changes: 5 additions & 9 deletions unstructured/partition/common/metadata.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
import copy
import datetime as dt
import functools
import itertools
import os
from typing import Any, Callable, Iterator, Sequence

Expand Down Expand Up @@ -252,15 +251,12 @@ def _assign_hash_ids(elements: list[Element]) -> list[Element]:
or more fragments for parallel processing.
"""
# -- generate sequence number for each element on a page --
page_numbers = [e.metadata.page_number for e in elements]
page_seq_numbers = [
seq_on_page
for _, group in itertools.groupby(page_numbers)
for seq_on_page, _ in enumerate(group)
]

for element, seq_on_page_counter in zip(elements, page_seq_numbers):
page_seq_counts = {}
for element in elements:
page_number = element.metadata.page_number
seq_on_page_counter = page_seq_counts.get(page_number, 0)
element.id_to_hash(seq_on_page_counter)
page_seq_counts[page_number] = seq_on_page_counter + 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Hash ID Assignment Breaks Page Consistency

The _assign_hash_ids function now assigns sequence numbers based on a global count per page, rather than resetting for each consecutive group of elements on a page. This changes hash ID generation when page numbers are not consecutive, breaking deterministic results and potentially affecting downstream systems.

Fix in Cursor Fix in Web


return elements

Expand Down
Loading