Skip to content

The confidence score #3903

Open
Open
@ranjit-tiger

Description

@ranjit-tiger

Describe the bug
Post parsing PDF , how to validate the parsing results

To Reproduce
detection_class_prob, This key is not consistent that is, it is not available for all extracted elements.

Expected behavior
Let's say i am parsing a pdf which have images, texts, tables as image etc. I have used partition_pdf() and used hi_res as strategy. Now the behaviour should ,for each element in metadata ,detection_class_prob key should be available which will tell confidence score.However i am not seeing the detection_class_prob for few elements. Like for a Table element detection_class_prob is available and for Image element detection_class_prob is not, Simillarly for other elements the key is unavailable. Expected is to have this key for all the elements.

Screenshots

Image

Image

Environment Info
please use 👍
unstructured version : 0.16.23

raw_pdf_elements=partition_pdf(
    filename="/content/data/Cocktails_Spirits.pdf",
    strategy="hi_res",
    infer_table_structure=True,  # Infers table structures from content
    extract_images_in_pdf=True,  # Extract images from the PDF
    extract_image_block_types=["Image", "Table"],  # Image and Table extraction
    extract_image_block_to_payload=True,  # Return images in the response
    output_format="application/json",  # JSON output format
    extract_image_block_output_dir="extracted_data_test"
  )

Additional context
probabilities value we should get.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions