Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.6.9
0.6.9
Enhancements
- fast strategy for pdf now keeps element bounding box data
- setup.py refactor
Features
Fixes
- Adds functionality to try other common encodings if an error related to the encoding is raised and the user has not specified an encoding.
- Adds additional MIME types for CSV
0.6.8
0.6.8
Enhancements
Features
- Add
partition_csv
for CSV files.
Fixes
0.6.7
0.6.7
Enhancements
- Deprecate
--s3-url
in favor of--remote-url
in CLI - Refactor out non-connector-specific config variables
- Add
file_directory
to metadata - Add
page_name
to metadata. Currently used for the sheet name in XLSX documents. - Added a
--partition-strategy
parameter to unstructured-ingest so that users can specify
partition strategy in CLI. For example,--partition-strategy fast
. - Added metadata for filetype.
- Add Discord connector to pull messages from a list of channels
- Refactor
unstructured/file-utils/filetype.py
to better utilise hashmap to return mime type. - Add local declaration of DOCX_MIME_TYPES and XLSX_MIME_TYPES for
test_filetype.py
.
Features
- Add
partition_xml
for XML files. - Add
partition_xlsx
for Microsoft Excel documents.
Fixes
- Supports
hml
filetype for partition as a variation of html filetype. - Makes
pytesseract
a function level import inpartition_pdf
so you can use the"fast"
or"hi_res"
strategies ifpytesseract
is not installed. Also adds the
required_dependencies
decorator for the"hi_res"
and"ocr_only"
strategies. - Fix to ensure
filename
is tracked in metadata fordocx
tables.
0.6.6
0.6.6
Enhancements
- Adds an
"auto"
strategy that chooses the partitioning strategy based on document
characteristics and function kwargs. This is the new default strategy forpartition_pdf
andpartition_image
. Users can maintain existing behavior by explicitly setting
strategy="hi_res"
. - Added an additional trace logger for NLP debugging.
- Add
get_date
method toElementMetadata
for converting the datestring to adatetime
object. - Cleanup the
filename
attribute onElementMetadata
to remove the full filepath.
Features
- Added table reading as html with URL parsing to
partition_docx
in docx - Added metadata field for text_as_html for docx files
Fixes
fileutils/file_type
check json and eml decode ignore errorpartition_email
was updated to more flexibly handle deviations from the RFC-2822 standard.
The time in the metadata returnsNone
if the time does not match RFC-2822 at all.- Include all metadata fields when converting to dataframe or CSV
0.6.5
0.6.5
Enhancements
- Added support for SpooledTemporaryFile file argument.
Features
Fixes
0.6.4
0.6.4
Enhancements
- Added an "ocr_only" strategy for
partition_pdf
. Refactored the strategy decision
logic into its own module.
Features
Fixes
0.6.3
0.6.3
Enhancements
- Add an "ocr_only" strategy for
partition_image
.
Features
- Added
partition_multiple_via_api
for partitioning multiple documents in a single REST
API call. - Added
stage_for_baseplate
function to prepare outputs for ingestion into Baseplate. - Added
partition_odt
for processing Open Office documents.
Fixes
- Updates the grouping logic in the
partition_pdf
fast strategy to group together text
in the same bounding box.
0.6.2
0.6.2
Enhancements
- Added logic to
partition_pdf
for detecting copy protected PDFs and falling back
to the hi res strategy when necessary.
Features
- Add
partition_via_api
for partitioning documents through the hosted API.
Fixes
- Fix how
exceeds_cap_ratio
handles empty (returnsTrue
instead ofFalse
) - Updates
detect_filetype
to properly detect JSONs when the MIME type istext/plain
.
0.6.1
0.6.0
0.6.0
Enhancements
- Adds an
ssl_verify
kwarg topartition
andpartition_html
to enable turning off
SSL verification for HTTP requests. SSL verification is on by default. - Allows users to pass in ocr language to
partition_pdf
andpartition_image
through
theocr_language
kwarg.ocr_language
corresponds to the code for the language pack
in Tesseract. You will need to install the relevant Tesseract language pack to use a
given language.
Features
- Table extraction is now possible for pdfs from
partition
andpartition_pdf
. - Adds support for extracting attachments from
.msg
files