Skip to content

Releases: Unstructured-IO/unstructured

0.10.14

11 Sep 19:28
59e850b
Compare
Choose a tag to compare

0.10.14

Enhancements

  • Update all connectors to use new downstream architecture
    • New click type added to parse comma-delimited string inputs
    • Some CLI options renamed

0.10.13

11 Sep 02:31
d0749d1
Compare
Choose a tag to compare

0.10.13

Enhancements

  • Updated documentation: Added back support doc types for partitioning, more Python codes in the API page, RAG definition, and use case.
  • Updated Hi-Res Metadata: PDFs and Images using Hi-Res strategy now have layout model class probabilities added ot metadata.
  • Updated the _detect_filetype_from_octet_stream() function to use libmagic to infer the content type of file when it is not a zip file.
  • Tesseract minor version bump to 5.3.2

Features

  • Add Jira Connector to be able to pull issues from a Jira organization
  • Add clean_ligatures function to expand ligatures in text

Fixes

  • partition_html breaks on <br> elements.
  • Ingest error handling to properly raise errors when wrapped
  • GH issue 1361: fixes a sortig error that prevented some PDF's from being parsed
  • Bump unstructured-inference
    • Brings back embedded images in PDF's (0.5.23)

0.10.12

04 Sep 02:10
c72014f
Compare
Choose a tag to compare

0.10.12

Enhancements

  • Removed PIL pin as issue has been resolved upstream
  • Bump unstructured-inference
    • Support for yolox_quantized layout detection model (0.5.20)
  • YoloX element types added

Features

  • Add Salesforce Connector to be able to pull Account, Case, Campaign, EmailMessage, Lead

Fixes

  • Bump unstructured-inference
    • Avoid divide-by-zero errors swith safe_division (0.5.21)

0.10.11

01 Sep 04:30
6534411
Compare
Choose a tag to compare

0.10.11

Enhancements

  • Bump unstructured-inference
    • Combine entire-page OCR output with layout-detected elements, to ensure full coverage of the page (0.5.19)

Features

  • Add in ingest cli s3 writer

Fixes

  • Fix a bug where xy-cut sorting attemps to sort elements without valid coordinates; now xy cut sorting only works when all elements have valid coordinates

0.10.10

31 Aug 02:14
a4ec43a
Compare
Choose a tag to compare

0.10.10

Enhancements

  • Adds text as an input parameter to partition_xml.
  • partition_xml no longer runs through partition_text, avoiding incorrect splitting
    on carriage returns in the XML. Since partition_xml no longer calls partition_text,
    min_partition and max_partition are no longer supported in partition_xml.
  • Bump unstructured-inference==0.5.18, change non-default detectron2 classification threshold
  • Upgrade base image from rockylinux 8 to rockylinux 9
  • Serialize IngestDocs to JSON when passing to subprocesses

Features

Fixes

  • Fix a bug where mismatched elements and bboxes are passed into add_pytesseract_bbox_to_elements

0.10.9

30 Aug 04:20
e4535d2
Compare
Choose a tag to compare

0.10.9

Enhancements

  • Fix test_json to handle only non-extra dependencies file types (plain-text)

Features

  • Adds chunk_by_title to break a document into sections based on the presence of Title
    elements.

Fixes

  • Make cv2 dependency optional
  • Edit add_pytesseract_bbox_to_elements's (ocr_only strategy) metadata.coordinates.points return type to Tuple for consistency.
  • Re-enable test-ingest-confluence-diff for ingest tests
  • Fix syntax for ingest test check number of files

0.10.8

28 Aug 01:32
ba70828
Compare
Choose a tag to compare

0.10.8

Enhancements

  • Release docker image that installs Python 3.10 rather than 3.8

Features

Fixes

0.10.7

27 Aug 17:28
4c13d12
Compare
Choose a tag to compare

0.10.7

Enhancements

Features

Fixes

  • Remove overly aggressive ListItem chunking for images and PDF's which typically resulted in inchorent elements.

0.10.6

26 Aug 01:12
3f1c90e
Compare
Choose a tag to compare

0.10.6

Enhancements

  • Enable partition_email and partition_msg to detect if an email is PGP encryped. If
    and email is PGP encryped, the functions will return an empy list of elements and
    emit a warning about the encrypted content.
  • Add threaded Slack conversations into Slack connector output
  • Add functionality to sort elements using xy-cut sorting approach in partition_pdf for hi_res and fast strategies
  • Bump unstructured-inference
    • Set OMP_THREAD_LIMIT to 1 if not set for better tesseract perf (0.5.17)

Features

  • Extract coordinates from PDFs and images when using OCR only strategy and add to metadata

Fixes

  • Update partition_html to respect the order of <pre> tags.
  • Fix bug in partition_pdf_or_image where two partitions were called if strategy == "ocr_only".
  • Bump unstructured-inference
    • Fix issue where temporary files were being left behind (0.5.16)
  • Adds deprecation warning for the file_filename kwarg to partition, partition_via_api,
    and partition_multiple_via_api.
  • Fix documentation build workflow by pinning dependencies

0.10.5

22 Aug 23:11
e7d189f
Compare
Choose a tag to compare

0.10.5

Enhancements

  • partition raises an error and tells the user to install the appropriate extra if a filetype
    is detected that is missing dependencies.
  • Add custom errors to ingest
  • Bump unstructured-ingest==0.5.15
    • Handle an uncaught TesseractError (0.5.15)
    • Add TIFF test file and TIFF filetype to test_from_image_file in test_layout (0.5.14)
  • Use entire_page ocr mode for pdfs and images
  • Add notes on extra installs to docs

Features

  • Add delta table connector

Fixes