Skip to content

feat/add optional flag to disable image text extraction under chunking strategy 'hi-res' #3520

Open
@ajpanyteam

Description

@ajpanyteam

Is your feature request related to a problem? Please describe.
I would like an optional flag for chunking strategy 'hi-res' so that tables are extracted but not images. Image text extracted as gibberish if the image has text in it. This is impacting RAG.

Describe the solution you'd like
An optional flag to switch off image extraction.

Describe alternatives you've considered
Tried to partition first and then pick off the elements I do not want, but then I need to chunk it. But the TS SDK does not support chunk_by_title function that is present in Python.

from unstructured.chunking.title import chunk_by_title
chunks = chunk_by_title(elements)  

Additional context
Example of image text extracted:

text: « Realtime - irk v * Realime . & ta to Delta rigger = 1 Realtim: ond: o ager =1 second: JSON to Dota (clean up) Watermar oving A rersoes whh — I e L - @8 Azure Data Laj JISON files DBX Autoloader DBX DBSQL Warehouse Structyred Streaming Workflows n Data Apps f A DELTA LAKE DELTA LAKE DELTA LAKE Azure loT Hub T 'Raspberry Pi On site sensors. @400Hz\n + \n + 'Sample Workflow:\n

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions