Description
When extracting tables from a PDF, unstructured-IO open-source produced mixed results. Like the text extracted is perfectly matching with the PDF whereas text_as_html is not. The result contains characters like '{' or ';' or 'F . '
How to resolve or clean these kinds of little mess with text_as_html?
To Reproduce
elements = partition_pdf(filename=filepath,
strategy="hi_res",
infer_table_structure=True,
include_orig_elements=True,
max_characters=6000,
split_pdf_page=True,
hi_res_model_name = "yolox",
overlap=True,
languages=['eng'],
chunking_strategy="by_title")
for el in elements
if el.category == "Table" and el.metadata.text_as_html is not None:
print(el.text)
print(el.metadata.text_as_html)
Text Result
'Regd Id 1 172644600000101 81 2 225698300000100 40 Entity Name Entity ID DOS EFP EPS INCUBATOR BUSINESS SERVICES LIMITED 726446000 01/07/2019 01/07/2019 PUZZLE Limited 2256983000 22/02/2021 22/02/2021 FSP NOT AVAILABLE NOT AVAILABLE EFP 09/02/2021 01/02/2022 DOC EPS FSP 09/02/2021 NOT AVAILABLE 01/02/2022 NOT AVAILABLE'
Text_as_html Result
Sn. | Regd Id | Entity Name | ; Entity ID | DOS | DOC | ||||
---|---|---|---|---|---|---|---|---|---|
EFP | EPS | FSP | EFP | EPS | FSP | ||||
1 | 172644600000101 81 | INCUBATOR BUSINESS SERVICES LIMITED | 1726446000 | 01/07/2019 | 01/07/2019 | NOT AVAILABLE | 09/02/2021 | 09/02/2021 | NOT AVAILABLE |
2 | 225698300000100 40 | F . PUZZLE Limited | 2256983000 | {22/02/2021 | 22/02/2021 | NOT AVAILABLE | 01/02/2022 | 01/02/2022 | NOT AVAILABLE |