Skip to content

bug/text-as-html-missing-content #3358

Open
@mpolomdeepsense

Description

@mpolomdeepsense

Describe the bug
Sometimes when using chunking, the text_as_html for Table elements is missing some of the content compared to text property.
Reasoning:

  • Text for a table can only come from within the cells of the table.
  • Therefore If a Table element has text, it must have come from one or more of the table cells.
  • Therefore the text_as_html table should be populated with text in those same cells.

To Reproduce

import unstructured_client
from unstructured_client.models import operations, shared
from unstructured_client.models.errors import SDKError
from unstructured.staging.base import elements_from_dicts

client = unstructured_client.UnstructuredClient(
    api_key_auth="...",
    server_url=" ...",
)

filename_a = r"doc.pdf"

with open(filename_a, "rb") as f:
    data = f.read()

req = operations.PartitionRequest(
    partition_parameters=shared.PartitionParameters(
        files=shared.Files(
            content=data,
            file_name=filename_a,
        ),
        strategy = "hi_res",
        coordinates=True,
        hi_res_model_name = "yolox",
        chunking_strategy="by_page",
        split_pdf_page=False,
        include_page_breaks=True,
        output_format = "application/json",
        languages=['eng'],
    ),
)

resp = client.general.partition(req)

elements = elements_from_dicts(resp.elements)
tables = [e for e in elements if e.category == "Table"]
for table in tables:
    dataframe = pd.read_html(e.metadata.text_as_html)
    print(dataframe)

Expected behavior
Chunked elements text and text_as_html contain the same content (text_as_html has that content parsed to an HTML table).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpdf

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions