Skip to content

bug/'text_as_html' result contains few incorrect/invalid characters #3523

Open
@MuruganDurai

Description

@MuruganDurai

When extracting tables from a PDF, unstructured-IO open-source produced mixed results. Like the text extracted is perfectly matching with the PDF whereas text_as_html is not. The result contains characters like '{' or ';' or 'F . '

How to resolve or clean these kinds of little mess with text_as_html?

To Reproduce
elements = partition_pdf(filename=filepath,
strategy="hi_res",
infer_table_structure=True,
include_orig_elements=True,
max_characters=6000,
split_pdf_page=True,
hi_res_model_name = "yolox",
overlap=True,
languages=['eng'],
chunking_strategy="by_title")

for el in elements
if el.category == "Table" and el.metadata.text_as_html is not None:
print(el.text)
print(el.metadata.text_as_html)

Text Result
'Regd Id 1 172644600000101 81 2 225698300000100 40 Entity Name Entity ID DOS EFP EPS INCUBATOR BUSINESS SERVICES LIMITED 726446000 01/07/2019 01/07/2019 PUZZLE Limited 2256983000 22/02/2021 22/02/2021 FSP NOT AVAILABLE NOT AVAILABLE EFP 09/02/2021 01/02/2022 DOC EPS FSP 09/02/2021 NOT AVAILABLE 01/02/2022 NOT AVAILABLE'

Text_as_html Result

Sn.Regd IdEntity Name; Entity IDDOSDOC
EFPEPSFSPEFPEPSFSP
1172644600000101 81INCUBATOR BUSINESS SERVICES LIMITED172644600001/07/201901/07/2019NOT AVAILABLE09/02/202109/02/2021NOT AVAILABLE
2225698300000100 40F . PUZZLE Limited2256983000{22/02/202122/02/2021NOT AVAILABLE01/02/202201/02/2022NOT AVAILABLE

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpdf

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions