bug/'text_as_html' result contains few incorrect/invalid characters

When extracting tables from a PDF, unstructured-IO open-source produced mixed results. Like the text extracted is perfectly matching with the PDF whereas text_as_html is not. The result contains characters like '{' or ';' or 'F . ' 

How to resolve or clean these kinds of little mess with text_as_html?

**To Reproduce**
elements = partition_pdf(filename=filepath,
                                      strategy="hi_res", 
                                      infer_table_structure=True, 
                                      include_orig_elements=True,
                                      max_characters=6000,
                                      split_pdf_page=True,
                                      hi_res_model_name = "yolox",
                                      overlap=True,
                                      languages=['eng'],
                                      chunking_strategy="by_title")

for el in elements
      if el.category == "Table" and el.metadata.text_as_html is not None:
                print(el.text)
                print(el.metadata.text_as_html)

**Text Result**
'Regd Id 1 172644600000101 81 2 225698300000100 40 Entity Name Entity ID DOS EFP EPS INCUBATOR BUSINESS SERVICES  LIMITED 726446000 01/07/2019 01/07/2019 PUZZLE  Limited 2256983000 22/02/2021 22/02/2021 FSP NOT AVAILABLE NOT AVAILABLE EFP 09/02/2021 01/02/2022 DOC EPS FSP 09/02/2021 NOT AVAILABLE 01/02/2022 NOT AVAILABLE'

**Text_as_html Result**
<table><thead><tr><th rowspan="2">Sn.</th><th rowspan="2">Regd Id</th><th rowspan="2">Entity Name</th><th rowspan="2">; Entity ID</th><th colspan="3">DOS</th><th colspan="3">DOC</th></tr><tr><th>EFP</th><th>EPS</th><th>FSP</th><th>EFP</th><th>EPS</th><th>FSP</th></tr></thead><tbody><tr><td>1</td><td>172644600000101 81</td><td>INCUBATOR BUSINESS SERVICES  LIMITED</td><td>1726446000</td><td>01/07/2019</td><td>01/07/2019</td><td>NOT AVAILABLE</td><td>09/02/2021</td><td>09/02/2021</td><td>NOT AVAILABLE</td></tr><tr><td>2</td><td>225698300000100 40</td><td>F . PUZZLE  Limited</td><td>2256983000</td><td>{22/02/2021</td><td>22/02/2021</td><td>NOT AVAILABLE</td><td>01/02/2022</td><td>01/02/2022</td><td>NOT AVAILABLE</td></tr></tbody></table>




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug/'text_as_html' result contains few incorrect/invalid characters #3523

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sn.	Regd Id	Entity Name	; Entity ID	DOS			DOC
Sn.	Regd Id	Entity Name	; Entity ID	EFP	EPS	FSP	EFP	EPS	FSP
1	172644600000101 81	INCUBATOR BUSINESS SERVICES LIMITED	1726446000	01/07/2019	01/07/2019	NOT AVAILABLE	09/02/2021	09/02/2021	NOT AVAILABLE
2	225698300000100 40	F . PUZZLE Limited	2256983000	{22/02/2021	22/02/2021	NOT AVAILABLE	01/02/2022	01/02/2022	NOT AVAILABLE

bug/'text_as_html' result contains few incorrect/invalid characters #3523

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions