Skip to content

Suggestion: include consolidated bounding box coordinates in chunk metadata when using "by_title" chunking strategy #3194

Open
@m-kemarskyi

Description

@m-kemarskyi

Problem
Currently when "by_title" chunking strategy is used and coordinates = true parameter is set (in order to return coordinates of the PDF chunks), coordinates are not returned (because in this strategy separate chunks are joined under the hood, which may span multiple pages).

"by_title" strategy is really useful because "default" strategy often returns really small chunks (containing one word or a couple of words). Therefore, inability to use coordinates with "by_title" strategy blocks use cases which require coordinates of text blocks in PDF files.

Suggestion
The suggestion is to return consolidated bounding box coordinates when "by_title" chunking strategy is used, returning a rectangle with extreme coordinates of the included chunks if multipage_sections = False parameter is passed (therefore chunks cannot span multiple pages and Unstructured.io API can calculate bounding box coordinates on the single page).

Additional context
The issue was discussed here: #1698

Metadata

Metadata

Assignees

No one assigned

    Labels

    chunkingRelated to element chunking.enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions