Skip to content

Navigation Menu

Appearance settings

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

Appearance settings

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

Unstructured-IO / unstructured Public

Notifications You must be signed in to change notification settings
Fork 975
Star 11.8k

Code
Issues 167
Pull requests 51
Discussions
Actions
Projects 1
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Security
Insights

Releases: Unstructured-IO/unstructured

Releases · Unstructured-IO/unstructured

0.7.0

31 May 20:13

MthwRobinson

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.7.0

0.7.0

Enhancements

Installing detectron2 from source is no longer required when using the local-inference extra.
Updates .pptx parsing to include text in tables.

Features

Fixes

Fixes an issue in _add_element_metadata that caused all elements to have page_number=1
in the element metadata.
Adds .log as a file extension for TXT files.
Adds functionality to try other common encodings for email (.eml) files if an error related to the encoding is raised and the user has not specified an encoding.
Allow passed encoding to be used in the replace_mime_encodings
Fixes page metadata for partition_html when include_metadata=False
A ValueError now raises if file_filename is not specified when you use partition_via_api
with a file-like object.

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

0.6.11

30 May 13:47

yuming-long

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.6.11

0.6.11

Enhancements

Supports epub tests since pandoc is updated in base image

Features

Fixes

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

0.6.10

26 May 08:57

cragwolfe

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.6.10

0.6.10

Enhancements

XLS support from auto partition

Features

Fixes

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

0.6.9

24 May 22:31

qued

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.6.9

0.6.9

Enhancements

fast strategy for pdf now keeps element bounding box data
setup.py refactor

Features

Fixes

Adds functionality to try other common encodings if an error related to the encoding is raised and the user has not specified an encoding.
Adds additional MIME types for CSV

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

0.6.8

19 May 19:58

MthwRobinson

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.6.8

0.6.8

Enhancements

Features

Add partition_csv for CSV files.

Fixes

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

0.6.7

19 May 17:31

MthwRobinson

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.6.7

0.6.7

Enhancements

Deprecate --s3-url in favor of --remote-url in CLI
Refactor out non-connector-specific config variables
Add file_directory to metadata
Add page_name to metadata. Currently used for the sheet name in XLSX documents.
Added a --partition-strategy parameter to unstructured-ingest so that users can specify
partition strategy in CLI. For example, --partition-strategy fast.
Added metadata for filetype.
Add Discord connector to pull messages from a list of channels
Refactor unstructured/file-utils/filetype.py to better utilise hashmap to return mime type.
Add local declaration of DOCX_MIME_TYPES and XLSX_MIME_TYPES for test_filetype.py.

Features

Add partition_xml for XML files.
Add partition_xlsx for Microsoft Excel documents.

Fixes

Supports hml filetype for partition as a variation of html filetype.
Makes pytesseract a function level import in partition_pdf so you can use the "fast"
or "hi_res" strategies if pytesseract is not installed. Also adds the
required_dependencies decorator for the "hi_res" and "ocr_only" strategies.
Fix to ensure filename is tracked in metadata for docx tables.

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

0.6.6

12 May 17:47

MthwRobinson

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.6.6

0.6.6

Enhancements

Adds an "auto" strategy that chooses the partitioning strategy based on document
characteristics and function kwargs. This is the new default strategy for partition_pdf
and partition_image. Users can maintain existing behavior by explicitly setting
strategy="hi_res".
Added an additional trace logger for NLP debugging.
Add get_date method to ElementMetadata for converting the datestring to a datetime object.
Cleanup the filename attribute on ElementMetadata to remove the full filepath.

Features

Added table reading as html with URL parsing to partition_docx in docx
Added metadata field for text_as_html for docx files

Fixes

fileutils/file_type check json and eml decode ignore error
partition_email was updated to more flexibly handle deviations from the RFC-2822 standard.
The time in the metadata returns None if the time does not match RFC-2822 at all.
Include all metadata fields when converting to dataframe or CSV

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

0.6.5

10 May 04:40

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.6.5

0.6.5

Enhancements

Added support for SpooledTemporaryFile file argument.

Features

Fixes

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

0.6.4

08 May 17:57

MthwRobinson

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.6.4

0.6.4

Enhancements

Added an "ocr_only" strategy for partition_pdf. Refactored the strategy decision
logic into its own module.

Features

Fixes

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

0.6.3

04 May 20:25

MthwRobinson

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.6.3

0.6.3

Enhancements

Add an "ocr_only" strategy for partition_image.

Features

Added partition_multiple_via_api for partitioning multiple documents in a single REST
API call.
Added stage_for_baseplate function to prepare outputs for ingestion into Baseplate.
Added partition_odt for processing Open Office documents.

Fixes

Updates the grouping logic in the partition_pdf fast strategy to group together text
in the same bounding box.

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

JSv4 and RalfNorthman reacted with thumbs up emoji

All reactions

👍 2 reactions

2 people reacted

Previous 1 2 … 12 13 14 15 16 17 18 19 Next

Footer

© 2025 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.