Skip to content

0.10.28

Compare
Choose a tag to compare
@cragwolfe cragwolfe released this 31 Oct 06:02
· 784 commits to main since this release
ecbc454

0.10.28

Enhancements

  • Add element type CI evaluation workflow Adds element type frequency evaluation metrics to the current ingest workflow to measure the performance of each file extracted as well as aggregated-level performance.
  • Add table structure evaluation helpers Adds functions to evaluate the similarity between predicted table structure and actual table structure.
  • Use yolox by default for table extraction when partitioning pdf/image yolox model provides higher recall of the table regions than the quantized version and it is now the default element detection model when infer_table_structure=True for partitioning pdf/image files
  • Remove pdfminer elements from inside tables Previously, when using hi_res some elements where extracted using pdfminer too, so we removed pdfminer from the tables pipeline to avoid duplicated elements.
  • Fsspec downstream connectors New destination connector added to ingest CLI, users may now use unstructured-ingest to write to any of the following:
    • Azure
    • Box
    • Dropbox
    • Google Cloud Service

Features

  • Update ocr_only strategy in partition_pdf() Adds the functionality to get accurate coordinate data when partitioning PDFs and Images with the ocr_only strategy.

Fixes

  • Fixes issue where tables from markdown documents were being treated as text Problem: Tables from markdown documents were being treated as text, and not being extracted as tables. Solution: Enable the tables extension when instantiating the python-markdown object. Importance: This will allow users to extract structured data from tables in markdown documents.
  • Fix wrong logger for paddle info Replace the logger from unstructured-inference with the logger from unstructured for paddle_ocr.py module.
  • Fix ingest pipeline to be able to use chunking and embedding together Problem: When ingest pipeline was using chunking and embedding together, embedding outputs were empty and the outputs of chunking couldn't be re-read into memory and be forwarded to embeddings. Fix: Added CompositeElement type to TYPE_TO_TEXT_ELEMENT_MAP to be able to process CompositeElements with unstructured.staging.base.isd_to_elements
  • Fix unnecessary mid-text chunk-splitting. The "pre-chunker" did not consider separator blank-line ("\n\n") length when grouping elements for a single chunk. As a result, sections were frequently over-populated producing a over-sized chunk that required mid-text splitting.
  • Fix frequent dissociation of title from chunk. The sectioning algorithm included the title of the next section with the prior section whenever it would fit, frequently producing association of a section title with the prior section and dissociating it from its actual section. Fix this by performing combination of whole sections only.
  • Fix PDF attempt to get dict value from string. Fixes a rare edge case that prevented some PDF's from being partitioned. The get_uris_from_annots function tried to access the dictionary value of a string instance variable. Assign None to the annotation variable if the instance type is not dictionary to avoid the erroneous attempt.