Skip to content

feat: Include coordinates of overlapped text in elements #3810

Open
@darrayes

Description

@darrayes

Describe the bug
When using the partition function with overlap and chunking parameters, the overlapped content from the previous chunk is included in the text of the new chunk, but the coordinates metadata for this overlapped portion is missing. This creates inconsistency between the text content and the available coordinate information.

To Reproduce

  1. Create a PDF document with multiple sections
  2. Use the following code to partition the document:
elements = partition(
    filename=filename,
    strategy="hi_res",
    chunking_strategy="by_title",
    max_characters=1500,
    combine_text_under_n_chars=300,
    unique_element_ids=True,
    overlap=170,
    overlap_all=True,
    skip_infer_table_types=[]
)
  1. Examine the resulting chunks, particularly focusing on the overlapped portions

Expected behavior
When text content is included in a chunk due to overlap settings, its corresponding coordinate metadata should also be included in the chunk's metadata. This ensures consistency between the text content and the available coordinate information for each chunk.

For example, in the current output:

{
    "type": "CompositeElement",
    "text": "Built the entire app infrastructure with Flutter... [overlapped content] ... SKILLS\n\nProgramming Languages Go, Python...",
    "metadata": [
        {
            "type": "Title",
            "text": "SKILLS",
            "metadata": {
                "coordinates": {
                    "points": [[75.8, 1981.9], ...]
                }
            }
        },
        {
            "type": "NarrativeText",
            "text": "Programming Languages Go, Python...",
            "metadata": {
                "coordinates": {
                    "points": [[75.8, 2033.6], ...]
                }
            }
        }
    ]
}

The coordinates for the overlapped portion ("Built the entire app infrastructure...") are missing, though the text is present in the chunk.

Screenshots
N/A

Environment Info

# Please run `python scripts/collect_env.py` and paste the output here

Additional context
This issue is particularly important when the coordinate information is needed for downstream tasks such as:

  • Highlighting text in the original document
  • Maintaining spatial relationships between text elements
  • Performing layout-aware text processing

The missing coordinates for overlapped content can lead to inconsistencies in applications that rely on both the text content and its spatial information.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions