Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.15.0
0.15.0
Enhancements
- Improve text clearing process in email partitioning. Updated the email partitioner to remove both
=\n
and=\r\n
characters during the clearing process. Previously, only=\n
characters were removed. - Bump unstructured.paddleocr to 2.8.0.1.
- Refine HTML parser to accommodate block element nested in phrasing. HTML parser no longer raises on a block element (e.g.
<p>
,<div>
) nested inside a phrasing element (e.g.<strong>
or<cite>
). Instead it breaks the phrasing run (and therefore element) at the block-item start and begins a new phrasing run after the block-item. This is consistent with how the browser determines element boundaries in this situation. - Install rewritten HTML parser to fix 12 existing bugs and provide headroom for refinement and growth. A rewritten HTML parser resolves a collection of outstanding bugs with HTML partitioning and provides a firm foundation for further elaborating that important partitioner.
- CI check for dependency licenses Adds a CI check to ensure dependencies are appropriately licensed.
Features
- Add support for specifying OCR language to
partition_pdf()
. Extend language specification capability toPaddleOCR
in addition toTesseractOCR
. Users can now specify OCR languages for both OCR engines when usingpartition_pdf()
. - Add AstraDB source connector Adds support for ingesting documents from AstraDB.
Fixes
- Remedy error on Windows when
nltk
binaries are downloaded. Work around a quirk in the Windows implementation oftempfile.NamedTemporaryFile
where accessing the temporary file by name raisesPermissionError
. - Move Astra embedded_dimension to write config
0.14.10
0.14.10
Enhancements
- Update unstructured-client dependency Change unstructured-client dependency pin back to
greater than min version and updated tests that were failing given the update. .doc
files are now supported in thearm64
image..libreoffice24
is added to thearm64
image, meaning.doc
files are now supported. We have follow on work planned to investigate adding.ppt
support forarm64
as well.- Add table detection metrics: recall, precision and f1
- Remove unused _with_spans metrics
Features
Fixes
- Fix counting false negatives and false positives in table structure evaluation
- Fix Slack CI test Change channel that Slack test is pointing to because previous test bot expired
- Remove NLTK download Removes
nltk.download
in favor of downloading from an S3 bucket we host to mitigate CVE-2024-39705
0.14.9
0.14.9
Enhancements
- Added visualization and OD model result dump for PDF In PDF
hi_res
strategy theanalysis
parameter can be used to visualize the result of the OD model and dump the result to a file. Additionally, the visualization of bounding boxes of each layout source is rendered and saved for each page. partition_docx()
distinguishes "file not found" from "not a ZIP archive" error.partition_docx()
now provides different error messages for "file not found" and "file is not a ZIP archive (and therefore not a DOCX file)". This aids diagnosis since these two conditions generally point in different directions as to the cause and fix.
Features
Fixes
- Fix a bug where multiple
soffice
processes could be attempted Add a wait mechanism inconvert_office_doc
so that the function first checks if anothersoffice
is running already: if yes wait till the other process finishes or till the wait timeout before spawning a subprocess to runsoffice
partition()
now forwardsstrategy
arg topartition_docx()
,partition_pptx()
, and their brokering partitioners for DOC, ODT, and PPT formats. Astrategy
argument passed topartition()
(or the default value "auto" assigned bypartition()
) is now forwarded topartition_docx()
,partition_pptx()
, and their brokering partitioners when those filetypes are detected.
0.14.8
0.14.8
Enhancements
- Move arm64 image to wolfi-base The
arm64
image now runs onwolfi-base
. Thearm64
build forwolfi-base
does not yet includelibreoffce
, and soarm64
does not currently support processing.doc
,.ppt
, or.xls
file. If you need to process those files onarm64
, use the legacyrockylinux
image.
Features
Fixes
-
Bump unstructured-inference==0.7.36 Fix
ValueError
when converting cells to html. -
partition()
now forwardsstrategy
arg topartition_docx()
,partition_ppt()
, andpartition_pptx()
. Astrategy
argument passed topartition()
(or the default value "auto" assigned bypartition()
) is now forwarded topartition_docx()
,partition_ppt()
, andpartition_pptx()
when those filetypes are detected. -
Fix missing sensitive field markers for embedders
0.14.7
0.14.7
Enhancements
- Pull from
wolfi-base
image. The amd64 image now pulls from theunstructured
wolfi-base
image to avoid duplication of dependency setup steps. - Fix windows temp file. Make the creation of a temp file in unstructured/partition/pdf_image/ocr.py windows compatible.
Features
- Expose conversion functions for tables Adds public functions to convert tables from HTML to the Deckerd format and back
Fixes
- Fix an error publishing docker images. Update user in docker-smoke-test to reflect changes made by the amd64 image pull from the "unstructured" "wolfi-base" image.
- **Fix a IndexError when partitioning a pdf with values for both
extract_image_block_types
andstarting_page_number
.
0.14.6
0.14.6
Enhancements
- Bump unstructured-inference==0.7.35 Fix syntax for generated HTML tables.
Features
- tqdm ingest support add optional flag to ingest flow to print out progress bar of each step in the process.
Fixes
- Remove deprecated
overwrite_schema
kwarg from Delta Table connector.. Theoverwrite_schema
kwarg is deprecated indeltalake>=0.18.0
.schema_mode=
should be used now instead.schema_mode="overwrite"
is equivalent tooverwrite_schema=True
andschema_mode="merge"
is equivalent tooverwrite_schema="False"
.schema_mode
defaults toNone
. You can also now specifyengine
, which defaults to"pyarrow"
. You need to specifyenginer="rust"
to use"schema_mode"
. - Fix passing parameters to python-client - Remove parsing list arguments to strings in passing arguments to python-client in Ingest workflow and
partition_via_api
- table metric bug fix get_element_level_alignment()now will find all the matched indices in predicted table data instead of only returning the first match in the case of multiple matches for the same gt string.
- fsspec connector path/permissions bug V2 fsspec connectors were failing when defined relative filepaths had leading slash. This strips that slash to guarantee the relative path never has it.
- Dropbox connector internal file path bugs Dropbox source connector currently raises exceptions when indexing files due to two issues: a path formatting idiosyncrasy of the Dropbox library and a divergence in the definition of the Dropbox libraries fs.info method, expecting a 'url' parameter rather than 'path'.
- update table metric evaluation to handle corrected HTML syntax for tables This change is connected to the update in unstructured-inference change - fixes transforming HTML table to deckerd and internal cells format.
0.14.5
0.14.5
Enhancements
- Filtering for tar extraction Adds tar filtering to the compression module for connectors to avoid decompression malicious content in
.tar.gz
files. This was added to the Pythontarfile
lib in Python 3.12. The change only applies when using Python 3.12 and above. - Use
python-oxmsg
forpartition_msg()
. Outlook MSG emails are now partitioned using thepython-oxmsg
package which resolves some shortcomings of the prior MSG parser.
Features
Fixes
- 8-bit string Outlook MSG files are parsed.
partition_msg()
is now able to parse non-unicode Outlook MSG emails. - Attachments to Outlook MSG files are extracted intact.
partition_msg()
is now able to extract attachments without corruption.
0.14.4
Enhancements
- Move logger error to debug level when PDFminer fails to extract text which includes error message for Invalid dictionary construct.
- Add support for Pinecone serverless Adds Pinecone serverless to the connector tests. Pinecone
serverless will work version versions >=0.14.2, but hadn't been tested until now.
Features
- Allow configuration of the Google Vision API endpoint Add an environment variable to select the Google Vision API in the US or the EU.
Fixes
- Address the issue of unrecognized tables in
UnstructuredTableTransformerModel
When a table is not recognized, theelement.metadata.text_as_html
attribute is set to an empty string. - Remove root handlers in ingest logger. Removes root handlers in ingest loggers to ensure secrets aren't accidentally exposed in Colab notebooks.
- Fix V2 S3 Destination Connector authentication Fixes bugs with S3 Destination Connector where the connection config was neither registered nor properly deserialized.
- Clarified dependence on particular version of
python-docx
Pinnedpython-docx
version to ensure a particular methodunstructured
uses is included. - Ingest preserves original file extension Ingest V2 introduced a change that dropped the original extension for upgraded connectors. This reverts that change.
0.14.3
Enhancements
- Move
category
field from Text class to Element class. partition_docx()
now supports pluggable picture sub-partitioners. A subpartitioner that accepts a DOCXParagraph
and generates elements is now supported. This allows adding a custom sub-partitioner that extracts images and applies OCR or summarization for the image.- Add VoyageAI embedder Adds VoyageAI embeddings to support embedding via Voyage AI.
Features
Fixes
- Fix
partition_pdf()
to keep spaces in the text. The control character\t
is now replaced with a space instead of being removed when merging inferred elements with embedded elements. - Turn off XML resolve entities Sets
resolve_entities=False
for XML parsing withlxml
to avoid text being dynamically injected into the XML document. - Add backward compatibility for the deprecated pdf_infer_table_structure parameter.
- Add the missing
form_extraction_skip_tables
argument to thepartition_pdf_or_image
call.
to avoid text being dynamically injected into the XML document. - Chromadb change from Add to Upsert using element_id to make idempotent
- Diable
table_as_cells
output by default to reduce overhead in partition; nowtable_as_cells
is only produced when the envEXTACT_TABLE_AS_CELLS
istrue
- Reduce excessive logging Change per page ocr info level logging into detail level trace logging
- Replace try block in
document_to_element_list
for handling HTMLDocument Usegetattr(element, "type", "")
to get thetype
attribute of an element when it exists. This is more explicit way to handle the special case for HTML documents and prevents other types of attribute error from being silenced by the try block
0.14.2
Enhancements
- Bump unstructured-inference==0.7.33.
Features
- Add attribution to the
pinecone
connector.