Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.10.14
0.10.13
0.10.13
Enhancements
- Updated documentation: Added back support doc types for partitioning, more Python codes in the API page, RAG definition, and use case.
- Updated Hi-Res Metadata: PDFs and Images using Hi-Res strategy now have layout model class probabilities added ot metadata.
- Updated the
_detect_filetype_from_octet_stream()
function to use libmagic to infer the content type of file when it is not a zip file. - Tesseract minor version bump to 5.3.2
Features
- Add Jira Connector to be able to pull issues from a Jira organization
- Add
clean_ligatures
function to expand ligatures in text
Fixes
partition_html
breaks on<br>
elements.- Ingest error handling to properly raise errors when wrapped
- GH issue 1361: fixes a sortig error that prevented some PDF's from being parsed
- Bump unstructured-inference
- Brings back embedded images in PDF's (0.5.23)
0.10.12
0.10.12
Enhancements
- Removed PIL pin as issue has been resolved upstream
- Bump unstructured-inference
- Support for yolox_quantized layout detection model (0.5.20)
- YoloX element types added
Features
- Add Salesforce Connector to be able to pull Account, Case, Campaign, EmailMessage, Lead
Fixes
- Bump unstructured-inference
- Avoid divide-by-zero errors swith
safe_division
(0.5.21)
- Avoid divide-by-zero errors swith
0.10.11
0.10.11
Enhancements
- Bump unstructured-inference
- Combine entire-page OCR output with layout-detected elements, to ensure full coverage of the page (0.5.19)
Features
- Add in ingest cli s3 writer
Fixes
- Fix a bug where
xy-cut
sorting attemps to sort elements without valid coordinates; now xy cut sorting only works when all elements have valid coordinates
0.10.10
0.10.10
Enhancements
- Adds
text
as an input parameter topartition_xml
. partition_xml
no longer runs throughpartition_text
, avoiding incorrect splitting
on carriage returns in the XML. Sincepartition_xml
no longer callspartition_text
,
min_partition
andmax_partition
are no longer supported inpartition_xml
.- Bump
unstructured-inference==0.5.18
, change non-default detectron2 classification threshold - Upgrade base image from rockylinux 8 to rockylinux 9
- Serialize IngestDocs to JSON when passing to subprocesses
Features
Fixes
- Fix a bug where mismatched
elements
andbboxes
are passed intoadd_pytesseract_bbox_to_elements
0.10.9
0.10.9
Enhancements
- Fix
test_json
to handle only non-extra dependencies file types (plain-text)
Features
- Adds
chunk_by_title
to break a document into sections based on the presence ofTitle
elements.
Fixes
- Make cv2 dependency optional
- Edit
add_pytesseract_bbox_to_elements
's (ocr_only
strategy)metadata.coordinates.points
return type toTuple
for consistency. - Re-enable test-ingest-confluence-diff for ingest tests
- Fix syntax for ingest test check number of files
0.10.8
0.10.7
0.10.6
0.10.6
Enhancements
- Enable
partition_email
andpartition_msg
to detect if an email is PGP encryped. If
and email is PGP encryped, the functions will return an empy list of elements and
emit a warning about the encrypted content. - Add threaded Slack conversations into Slack connector output
- Add functionality to sort elements using
xy-cut
sorting approach inpartition_pdf
forhi_res
andfast
strategies - Bump unstructured-inference
- Set OMP_THREAD_LIMIT to 1 if not set for better tesseract perf (0.5.17)
Features
- Extract coordinates from PDFs and images when using OCR only strategy and add to metadata
Fixes
- Update
partition_html
to respect the order of<pre>
tags. - Fix bug in
partition_pdf_or_image
where two partitions were called ifstrategy == "ocr_only"
. - Bump unstructured-inference
- Fix issue where temporary files were being left behind (0.5.16)
- Adds deprecation warning for the
file_filename
kwarg topartition
,partition_via_api
,
andpartition_multiple_via_api
. - Fix documentation build workflow by pinning dependencies
0.10.5
0.10.5
Enhancements
partition
raises an error and tells the user to install the appropriate extra if a filetype
is detected that is missing dependencies.- Add custom errors to ingest
- Bump
unstructured-ingest==0.5.15
- Handle an uncaught TesseractError (0.5.15)
- Add TIFF test file and TIFF filetype to
test_from_image_file
intest_layout
(0.5.14)
- Use
entire_page
ocr mode for pdfs and images - Add notes on extra installs to docs
Features
- Add delta table connector