Description
Describe the bug
When using the partition function with overlap and chunking parameters, the overlapped content from the previous chunk is included in the text of the new chunk, but the coordinates metadata for this overlapped portion is missing. This creates inconsistency between the text content and the available coordinate information.
To Reproduce
- Create a PDF document with multiple sections
- Use the following code to partition the document:
elements = partition(
filename=filename,
strategy="hi_res",
chunking_strategy="by_title",
max_characters=1500,
combine_text_under_n_chars=300,
unique_element_ids=True,
overlap=170,
overlap_all=True,
skip_infer_table_types=[]
)
- Examine the resulting chunks, particularly focusing on the overlapped portions
Expected behavior
When text content is included in a chunk due to overlap settings, its corresponding coordinate metadata should also be included in the chunk's metadata. This ensures consistency between the text content and the available coordinate information for each chunk.
For example, in the current output:
{
"type": "CompositeElement",
"text": "Built the entire app infrastructure with Flutter... [overlapped content] ... SKILLS\n\nProgramming Languages Go, Python...",
"metadata": [
{
"type": "Title",
"text": "SKILLS",
"metadata": {
"coordinates": {
"points": [[75.8, 1981.9], ...]
}
}
},
{
"type": "NarrativeText",
"text": "Programming Languages Go, Python...",
"metadata": {
"coordinates": {
"points": [[75.8, 2033.6], ...]
}
}
}
]
}
The coordinates for the overlapped portion ("Built the entire app infrastructure...") are missing, though the text is present in the chunk.
Screenshots
N/A
Environment Info
# Please run `python scripts/collect_env.py` and paste the output here
Additional context
This issue is particularly important when the coordinate information is needed for downstream tasks such as:
- Highlighting text in the original document
- Maintaining spatial relationships between text elements
- Performing layout-aware text processing
The missing coordinates for overlapped content can lead to inconsistencies in applications that rely on both the text content and its spatial information.