Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.10.4
0.10.2
0.10.1
0.10.1
Enhancements
- Bump unstructured-inference==0.5.12:
- fix to avoid trace for certain PDF's (0.5.12)
- better defaults for DPI for hi_res and Chipper (0.5.11)
- implement full-page OCR (0.5.10)
Features
Fixes
- Fix dead links in repository README (Quick Start > Install for local development, and Learn more > Batch Processing)
- Update document dependencies to include tesseract-lang for additional language support (required for tests to pass)
0.10.0
0.9.3
0.9.3
Enhancements
- Pinned dependency cleanup.
- Update
partition_csv
to always usesoupparser_fromstring
to parsehtml text
- Update
partition_tsv
to always usesoupparser_fromstring
to parsehtml text
- Add
metadata.section
to capture epub table of contents data - Add
unique_element_ids
kwarg to partition functions. IfTrue
, will use a UUID
for element IDs instead of a SHA-256 hash. - Update
partition_xlsx
to always usesoupparser_fromstring
to parsehtml text
- Add functionality to switch
html
text parser based on whether thehtml
text contains emoji - Add functionality to check if a string contains any emoji characters
Features
- Add Airtable Connector to be able to pull views/tables/bases from an Airtable organization
Fixes
- make notion module discoverable
- fix emails with
Content-Distribution: inline
andContent-Distribution: attachment
with no filename - Fix email attachment filenames which had
=
in the filename itself
0.9.2
0.9.2
Enhancements
- Update table extraction section in API documentation to sync with change in Prod API
- Update Notion connector to extract to html
- Bump unstructured-inference==0.5.9:
- better caching of models
- another version of detectron2 available, though the default layout model is unchanged
- Added UUID option for element_id
Features
- Adds Sharepoint connector.
Fixes
- Bump unstructured-inference==0.5.9:
- ignores Tesseract errors where no text is extracted for tiles that indeed, have no text
0.9.1
0.9.1
Enhancements
- Adds --partition-pdf-infer-table-structure to unstructured-ingest.
- Enable
partition_html
to skip headers and footers with theskip_headers_and_footers
flag. - Update
partition_doc
andpartition_docx
to track emphasized texts in the output - Adds post processing function
filter_element_types
- Set the default strategy for partitioning images to
hi_res
- Add page break parameter section in API documentation to sync with change in Prod API
- Update
partition_html
to track emphasized texts in the output - Update
XMLDocument._read_xml
to create<p>
tag element for the text enclosed in the<pre>
tag - Add parameter
include_tail_text
to_construct_text
to enable (skip) tail text inclusion - Add Notion connector
Features
Fixes
- Remove unused
_partition_via_api
function - Fixed emoji bug in
partition_xlsx
. - Pass
file_filename
metadata when partitioning file object - Skip ingest test on missing Slack token
- Add Dropbox variables to CI environments
- Remove default encoding for ingest
- Adds new element type
EmailAddress
for recognizing email address in the text - Simplifies
min_partition
logic; makes partitions falling below themin_partition
less likely. - Fix bug where ingest test check for number of files fails in smoke test
- Fix unstructured-ingest entrypoint failure
0.9.0
0.9.0
Enhancements
- Dependencies are now split by document type, creating a slimmer base installation.
0.8.8
0.8.7
0.8.7
Enhancements
- Put back useful function
split_by_paragraph
Features
Fixes
- Fix argument order in NLTK download step