Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

Sign up

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

Unstructured-IO / unstructured Public

Notifications You must be signed in to change notification settings
Fork 904
Star 10.9k

Code
Issues 156
Pull requests 50
Discussions
Actions
Projects 1
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Security
Insights

Releases: Unstructured-IO/unstructured

Releases · Unstructured-IO/unstructured

0.8.6

28 Jul 06:47

cragwolfe

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.8.6

0.8.6

Enhancements

Features

Fixes

Remove debug print lines and non-functional code

Assets 2

Loading

All reactions

0.8.5

27 Jul 18:34

yuming-long

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.8.5

0.8.5

Enhancements

Add parameter skip_infer_table_types to enable (skip) table extraction for other doc types
Adds optional Unstructured API unit tests in CI
Tracks last modified date for all document types.

Features

Fixes

NLTK now only gets downloaded if necessary.
Handling for empty tables in Word Documents and PowerPoints.

Assets 2

Loading

All reactions

0.8.4

26 Jul 18:09

Klaijan

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.8.4

0.8.4

Enhancements

Additional tests and refactor of JSON detection.
Update functionality to retrieve image metadata from a page for document_to_element_list
Links are now tracked in partition_html output.
Set the file's current position to the beginning after reading the file in convert_to_bytes
Add min_partition kwarg to that combines elements below a specified threshold and modifies splitting of strings longer than max partition so words are not split.
set the file's current position to the beginning after reading the file in convert_to_bytes
Add slide notes to pptx
Add --encoding directive to ingest
Improve json detection by detect_filetype

Features

Adds Outlook connector
Add support for dpi parameter in inference library
Adds Onedrive connector.
Add Confluence connector for ingest cli to pull the body text from all documents from all spaces in a confluence domain.

Fixes

Fixes issue with email partitioning where From field was being assigned the To field value.
Use the image_metadata property of the PageLayout instance to get the page image info in the document_to_element_list
Add functionality to write images to computer storage temporarily instead of keeping them in memory for ocr_only strategy
Add functionality to convert a PDF in small chunks of pages at a time for ocr_only strategy
Adds .txt, .text, and .tab to list of extensions to check if file
has a text/plain MIME type.
Enables filters to be passed to partition_doc so it doesn't error with LibreOffice7.
Removed old error message that's superseded by requires_dependencies.
Removes using hi_res as the default strategy value for partition_via_api and partition_multiple_via_api

Assets 2

Loading

All reactions

0.8.1: * Add support for Python 3.11

11 Jul 14:35

rbiseck3

Compare

Choose a tag to compare

Loading

0.8.1: * Add support for Python 3.11

0.8.1

Enhancements

Add support for Python 3.11

Features

Fixes

Fixed auto strategy detected scanned document as having extractable text and using fast strategy, resulting in no output.
Fix list detection in MS Word documents.
Don't instantiate an element with a coordinate system when there isn't a way to get its location data.

Assets 2

Loading

aliyeysides reacted with thumbs up emoji

All reactions

👍 1 reaction

1 person reacted

0.8.0

07 Jul 15:41

rbiseck3

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.8.0

Enhancements

Allow model used for hi res pdf partition strategy to be chosen when called.
Updated inference package

Features

Add metadata_filename parameter across all partition functions

Fixes

Adjust encoding recognition threshold value in detect_file_encoding
Fix KeyError when isd_to_elements doesn't find a type
Fix _output_filename for local connector, allowing single files to be written correctly to the disk
Fix for cases where an invalid encoding is extracted from an email header.

BREAKING CHANGES

Information about an element's location is no longer returned as top-level attributes of an element. Instead, it is returned in the coordinates attribute of the element's metadata.

Assets 2

Loading

All reactions

0.7.12

01 Jul 02:32

tabossert

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.7.12

0.7.12

Enhancements

Adds include_metadata kwarg to partition_doc, partition_docx, partition_email, partition_epub, partition_json, partition_msg, partition_odt, partition_org, partition_pdf, partition_ppt, partition_pptx, partition_rst, and partition_rtf

Features

Adds Dropbox connector

Fixes

Fix tests that call unstructured-api by passing through an api-key
Fixed page breaks being given (incorrect) page numbers
Fix skipping download on ingest when a source document exists locally

Assets 2

Loading

All reactions

0.7.11

30 Jun 01:42

cragwolfe

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.7.11

0.7.11

Enhancements

More deterministic element ordering when using hi_res PDF parsing strategy (from unstructured-inference bump to 0.5.4)
Make large model available (from unstructured-inference bump to 0.5.3)
Combine inferred elements with extracted elements (from unstructured-inference bump to 0.5.2)
partition_email and partition_msg will now process attachments if process_attachments=True
and a attachment partitioning functions is passed through with attachment_partitioner=partition.

Features

Fixes

Fix tests that call unstructured-api by passing through an api-key
Fixed page breaks being given (incorrect) page numbers
Fix skipping download on ingest when a source document exists locally

Assets 2

Loading

All reactions

0.7.10

28 Jun 19:27

MthwRobinson

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.7.10

0.7.10

Enhancements

Adds a max_partition parameter to partition_text, partition_pdf, partition_email,
partition_msg and partition_xml that sets a limit for the size of an individual
document elements. Defaults to 1500 for everything except partition_xml, which has
a default value of None.
DRY connector refactor

Features

hi_res model for pdfs and images is selectable via environment variable.

Fixes

CSV check now ignores escaped commas.
Fix for filetype exploration util when file content does not have a comma.
Adds negative lookahead to bullet pattern to avoid detecting plain text line
breaks like ------- as list items.
Fix pre tag parsing for partition_html
Fix lookup error for annotated Arabic and Hebrew encodings

Assets 2

Loading

All reactions

0.7.9

26 Jun 21:54

cragwolfe

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.7.9

0.7.9

Enhancements

Improvements to string check for leafs in partition_xml.
Adds --partition-ocr-languages to unstructured-ingest.

Features

Adds partition_org for processed Org Mode documents.

Fixes

Assets 2

Loading

All reactions

0.7.8

23 Jun 02:23

cragwolfe

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.7.8

0.7.8

Enhancements

Features

Adds Google Cloud Service connector

Fixes

Updates the parse_email for partition_eml so that unstructured-api passes the smoke tests
partition_email now works if there is no message content
Updates the "fast" strategy for partition_pdf so that it's able to recursively
Adds recursive functionality to all fsspec connectors
Adds generic --recursive ingest flag

Assets 2

Loading

All reactions

Previous 1 2 … 10 11 12 13 14 … 17 18 Next

Footer

© 2025 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.