Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.5.2
0.5.2
Enhancements
unstructured-ingest
now uses a default--download_dir
of$HOME/.cache/unstructured/ingest
rather than a "tmp-ingest-" dir in the working directory.
Features
Fixes
setup_ubuntu.sh
no longer fails in some contexts by interpreting
DEBIAN_FRONTEND=noninteractive
as a commandunstructured-ingest
no longer re-downloads files when --preserve-downloads
is used without --download-dir.- Fixed an issue that was causing text to be skipped in some HTML documents.
0.5.1
0.5.1
Enhancements
Features
Fixes
- Fixes an error causing JavaScript to appear in the output of
partition_html
sometimes. - Fix several issues with the
requires_dependencies
decorator, including the error message
and how it was used, which had caused an error forunstructured-ingest --github-url ...
.
0.5.0
0.5.0
Enhancements
- Add
requires_dependencies
Python decorator to check dependencies are installed before
instantiating a class or running a function
Features
- Added Wikipedia connector for ingest cli.
Fixes
- Fix
process_document
file cleaning on failure - Fixes an error introduced in the metadata tracking commit that caused
NarrativeText
andFigureCaption
elements to be represented asText
in HTML documents.
0.4.16
0.4.16
Enhancements
- Fallback to using file extensions for filetype detection if
libmagic
is not present
Features
- Added setup script for Ubuntu
- Added GitHub connector for ingest cli.
- Added
partition_md
partitioner. - Added Reddit connector for ingest cli.
Fixes
- Initializes connector properly in ingest.main::MainProcess
- Restricts version of unstructured-inference to avoid multithreading issue
0.4.15
0.4.15
Enhancements
- Added
elements_to_json
andelements_from_json
for easier serialization/deserialization convert_to_dict
,dict_to_elements
andconvert_to_csv
are now aliases for functions
that use the ISD terminology.
Fixes
- Update to ensure all elements are preserved during serialization/deserialization
0.4.14
0.4.14
- Automatically install
nltk
models in thetokenize
module.
0.4.13
0.4.12
0.4.11
0.4.11
- Adds
partition_doc
for partitioning Word documents in.doc
format. Requireslibreoffice
. - Adds
partition_ppt
for partitioning PowerPoint documents in.ppt
format. Requireslibreoffice
.
0.4.10
0.4.10
- Fixes
ElementMetadata
so that it's JSON serializable when the filename is aPath
object.