Skip to content

Navigation Menu

Appearance settings

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

Appearance settings

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

Unstructured-IO / unstructured Public

Notifications You must be signed in to change notification settings
Fork 976
Star 11.8k

Code
Issues 167
Pull requests 51
Discussions
Actions
Projects 1
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Discussions
Actions
Projects
Security
Insights

Releases: Unstructured-IO/unstructured

Releases · Unstructured-IO/unstructured

0.12.0

10 Jan 14:48

awalker4

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.12.0

Drop support for python3.8 All dependencies are now built off of the minimum version of python being 3.10

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

0.11.8

03 Jan 22:44

cragwolfe

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.11.8

0.11.8

Enhancements

Add SaaS API User Guide. This documentation serves as a guide for Unstructured SaaS API users to register, receive an API key and URL, and manage your account and billing information.

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

0.11.7

03 Jan 20:59

awalker4

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.11.7

Enhancements

Add intra-chunk overlap capability. Implement overlap for split-chunks where text-splitting is used to divide an oversized chunk into two or more chunks that fit in the chunking window. Note this capability is not yet available from the API but will shortly be made accessible using a new overlap kwarg on partition functions.
Update encoders to leverage dataclasses All encoders now follow a class approach which get annotated with the dataclass decorator. Similar to the connectors, it uses a nested dataclass for the configs required to configure a client as well as a field/property approach to cache the client. This makes sure any variable associated with the class exists as a dataclass field.

Features

Add Qdrant destination connector. Adds support for writing documents and embeddings into a Qdrant collection.
Store base64 encoded image data in metadata fields. Rather than saving to file, stores base64 encoded data of the image bytes and the mimetype for the image in metadata fields: image_base64 and image_mime_type (if that is what the user specifies by some other param like pdf_extract_to_payload). This would allow the API to have parity with the library.

Fixes

Fix table structure metric script Update the call to table agent to now provide OCR tokens as required
Fix element extraction not working when using "auto" strategy for pdf and image If element extraction is specified, the "auto" strategy falls back to the "hi_res" strategy.
Fix a bug passing a custom url to partition_via_api Users that self host the api were not able to pass their custom url to partition_via_api.

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

0.11.6

20 Dec 21:37

cragwolfe

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.11.6

0.11.6

Enhancements

Update the layout analysis script. The previous script only supported annotating final elements. The updated script also supports annotating inferred and extracted elements.
AWS Marketplace API documentation: Added the user guide, including setting up VPC and CloudFormation, to deploy Unstructured API on AWS platform.
Azure Marketplace API documentation: Improved the user guide to deploy Azure Marketplace API by adding references to Azure documentation.
Integration documentation: Updated URLs for the staging_for bricks

Features

Partition emails with base64-encoded text. Automatically handles and decodes base64 encoded text in emails with content type text/plain and text/html.
Add Chroma destination connector Chroma database connector added to ingest CLI. Users may now use unstructured-ingest to write partitioned/embedded data to a Chroma vector database.
Add Elasticsearch destination connector. Problem: After ingesting data from a source, users might want to move their data into a destination. Elasticsearch is a popular storage solution for various functionality such as search, or providing intermediary caches within data pipelines. Feature: Added Elasticsearch destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into Elasticsearch.

Fixes

Enable --fields argument omission for elasticsearch connector Solves two bugs where removing the optional parameter --fields broke the connector due to an integer processing error and using an elasticsearch config for a destination connector resulted in a serialization issue when optional parameter --fields was not provided.

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

0.11.5

17 Dec 02:27

cragwolfe

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.11.5

0.11.5

Enhancements

Features

Fixes

Fix partition_pdf() and partition_image() importation issue. Reorganize pdf.py and image.py modules to be consistent with other types of document import code.

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

0.11.4

15 Dec 01:12

cragwolfe

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.11.4

0.11.4

Enhancements

Refactor image extraction code. The image extraction code is moved from unstructured-inference to unstructured.
Refactor pdfminer code. The pdfminer code is moved from unstructured-inference to unstructured.
Improve handling of auth data for fsspec connectors. Leverage an extension of the dataclass paradigm to support a sensitive annotation for fields related to auth (i.e. passwords, tokens). Refactor all fsspec connectors to use explicit access configs rather than a generic dictionary.
Add glob support for fsspec connectors Similar to the glob support in the ingest local source connector, similar filters are now enabled on all fsspec based source connectors to limit files being partitioned.
Define a constant for the splitter "+" used in tesseract ocr languages.

Features

Save tables in PDF's separately as images. The "table" elements are saved as table-<pageN>-<tableN>.jpg. This filename is presented in the image_path metadata field for the Table element. The default would be to not do this.
Add Weaviate destination connector Weaviate connector added to ingest CLI. Users may now use unstructured-ingest to write partitioned data from over 20 data sources (so far) to a Weaviate object collection.
Sftp Source Connector. New source connector added to support downloading/partitioning files from Sftp.

Fixes

Fix pdf hi_res partitioning failure when pdfminer fails. Implemented logic to fall back to the "inferred_layout + OCR" if pdfminer fails in the hi_res strategy.
Fix a bug where image can be scaled too large for tesseract Adds a limit to prevent auto-scaling an image beyond the maximum size tesseract can handle for ocr layout detection
Update partition_csv to handle different delimiters CSV files containing both non-comma delimiters and commas in the data were throwing an error in Pandas. partition_csv now identifies the correct delimiter before the file is processed.
partition returning cid code in hi_res occasionally pdfminer can fail to decode the text in an pdf file and return cid code as text. Now when this happens the text from OCR is used.

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

0.11.2

30 Nov 04:40

cragwolfe

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.11.2

0.11.2

Enhancements

Updated Documentation: (i) Added examples, and (ii) API Documentation, including Usage, SDKs, Azure Marketplace, and parameters and validation errors.

Features

Add Pinecone destination connector. Problem: After ingesting data from a source, users might want to produce embeddings for their data and write these into a vector DB. Pinecone is an option among these vector databases. Feature: Added Pinecone destination connector to be able to ingest documents from any supported source, embed them and write the embeddings / documents into Pinecone.

Fixes

Process chunking parameter names in ingest correctly Solves a bug where chunking parameters weren't being processed and used by ingest cli by renaming faulty parameter names and prepends; adds relevant parameters to ingest pinecone test to verify that the parameters are functional.

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

0.11.1

29 Nov 21:48

pravin-unstructured

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.11.1

0.11.1

Enhancements

Use pikepdf to repair invalid PDF structure for PDFminer when we see error PSSyntaxError when PDFminer opens the document and creates the PDFminer pages object or processes a single PDF page.
Batch Source Connector support For instances where it is more optimal to read content from a source connector in batches, a new batch ingest doc is added which created multiple ingest docs after reading them in in batches per process.

Features

Staging Brick for Coco Format Staging brick which converts a list of Elements into Coco Format.
Adds HubSpot connector Adds connector to retrieve call, communications, emails, notes, products and tickets from HubSpot

Fixes

Do not extract text of <style> tags in HTML. <style> tags containing CSS in invalid positions previously contributed to element text. Do not consider text node of a <style> element as textual content.
Fix DOCX merged table cell repeats cell text. Only include text for a merged cell, not for each underlying cell spanned by the merge.
Fix tables not extracted from DOCX header/footers. Headers and footers in DOCX documents skip tables defined in the header and commonly used for layout/alignment purposes. Extract text from tables as a string and include in the Header and Footer document elements.
Fix output filepath for fsspec-based source connectors. Previously the base directory was being included in the output filepath unnecessarily.

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

0.11.0

20 Nov 19:07

yuming-long

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.11.0

0.11.0

Enhancements

Add a class for the strategy constants. Add a class PartitionStrategy for the strategy constants and use the constants to replace strategy strings.
Temporary Support for paddle language parameter. User can specify default langage code for paddle with ENV DEFAULT_PADDLE_LANG before we have the language mapping for paddle.
Improve DOCX page-break fidelity. Improve page-break fidelity such that a paragraph containing a page-break is split into two elements, one containing the text before the page-break and the other the text after. Emit the PageBreak element between these two and assign the correct page-number (n and n+1 respectively) to the two textual elements.

Features

Add ad-hoc fields to ElementMetadata instance. End-users can now add their own metadata fields simply by assigning to an element-metadata attribute-name of their choice, like element.metadata.coefficient = 0.58. These fields will round-trip through JSON and can be accessed with dotted notation.
MongoDB Destination Connector New destination connector added to all CLI ingest commands to support writing partitioned json output to mongodb.

Fixes

Fix TYPE_TO_TEXT_ELEMENT_MAP Updated Figure mapping from FigureCaption to Image.
Handle errors when extracting PDF text Certain pdfs throw unexpected errors when being opened by pdfminer, causing partition_pdf() to fail. We expect to be able to partition smoothly using an alternative strategy if text extraction doesn't work. Added exception handling to handle unexpected errors when extracting pdf text and to help determine pdf strategy.
Fix fast strategy fall back to ocr_only The fast strategy should not fall back to a more expensive strategy.
Remove default user ./ssh folder The default notebook user during image build would create the known_hosts file with incorrect ownership, this is legacy and no longer needed so it was removed.
Include languages in metadata when partitioning strategy=hi_res or fast User defined languages was previously used for text detection, but not included in the resulting element metadata for some strategies. languages will now be included in the metadata regardless of partition strategy for pdfs and images.
Handle a case where Paddle returns a list item in ocr_data as None In partition, while parsing PaddleOCR data, it was assumed that PaddleOCR does not return None for any list item in ocr_data. Removed the assumption by skipping the text region whenever this happens.
Fix some pdfs returning KeyError: 'N' Certain pdfs were throwing this error when being opened by pdfminer. Added a wrapper function for pdfminer that allows these documents to be partitioned.
Fix mis-splits on Table chunks. Remedies repeated appearance of full .text_as_html on metadata of each TableChunk split from a Table element too large to fit in the chunking window.
Import tables_agent from inference so that we don't have to initialize a global table agent in unstructured OCR again
Fix empty table is identified as bulleted-table. A table with no text content was mistakenly identified as a bulleted-table and processed by the wrong branch of the initial HTML partitioner.
Fix partition_html() emits empty (no text) tables. A table with cells nested below a <thead> or <tfoot> element was emitted as a table element having no text and unparseable HTML in element.metadata.text_as_html. Do not emit empty tables to the element stream.
Fix HTML element.metadata.text_as_html contains spurious
elements in invalid locations. The HTML generated for the text_as_html metadata for HTML tables contained <br> elements invalid locations like between <table> and <tr>. Change the HTML generator such that these do not appear.
Fix HTML table cells enclosed in and elements are dropped. HTML table cells nested in a <thead> or <tfoot> element were not detected and the text in those cells was omitted from the table element text and .text_as_html. Detect table rows regardless of the semantic tag they may be nested in.
Remove whitespace padding from .text_as_html. tabulate inserts padding spaces to achieve visual alignment of columns in HTML tables it generates. Add our own HTML generator to do this simple job and omit that padding as well as newlines ("\n") used for human readability.
Fix local connector with absolute input path When passed an absolute filepath for the input document path, the local connector incorrectly writes the output file to the input file directory. This fixes such that the output in this case is written to output-dir/input-filename.json

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

All reactions

0.10.30

10 Nov 19:36

cragwolfe

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.

GPG key ID: 4AEE18F83AFDEB23

Expired

Verified

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

0.10.30

0.10.30

Enhancements

Support nested DOCX tables. In DOCX, like HTML, a table cell can itself contain a table. In this case, create nested HTML tables to reflect that structure and create a plain-text table with captures all the text in nested tables, formatting it as a reasonable facsimile of a table.
Add connection check to ingest connectors Each source and destination connector now support a check_connection() method which makes sure a valid connection can be established with the source/destination given any authentication credentials in a lightweight request.

Features

Add functionality to do a second OCR on cropped table images. Changes to the values for scaling ENVs affect entire page OCR output(OCR regression) so we now do a second OCR for tables.
Adds ability to pass timeout for a request when partitioning via a url. partition now accepts a new optional parameter request_timeout which if set will prevent any requests.get from hanging indefinitely and instead will raise a timeout error. This is useful when partitioning a url that may be slow to respond or may not respond at all.

Fixes

Fix logic that determines pdf auto strategy. Previously, _determine_pdf_auto_strategy returned hi_res strategy only if infer_table_structure was true. It now returns the hi_res strategy if either infer_table_structure or extract_images_in_pdf is true.
Fix invalid coordinates when parsing tesseract ocr data. Previously, when parsing tesseract ocr data, the ocr data had invalid bboxes if zoom was set to 0. A logical check is now added to avoid such error.
Fix ingest partition parameters not being passed to the api. When using the --partition-by-api flag via unstructured-ingest, none of the partition arguments are forwarded, meaning that these options are disregarded. With this change, we now pass through all of the relevant partition arguments to the api. This allows a user to specify all of the same partition arguments they would locally and have them respected when specifying --partition-by-api.
Support tables in section-less DOCX. Generalize solution for MS Chat Transcripts exported as DOCX by including tables in the partitioned output when present.
Support tables that contain only numbers when partitioning via ocr_only Tables that contain only numbers are returned as floats in a pandas.DataFrame when the image is converted from .image_to_data(). An AttributeError was raised downstream when trying to .strip() the floats.
Improve DOCX page-break detection. DOCX page breaks are reliably indicated by w:lastRenderedPageBreak elements present in the document XML. Page breaks are NOT reliably indicated by "hard" page-breaks inserted by the author and when present are redundant to a w:lastRenderedPageBreak element so cause over-counting if used. Use rendered page-breaks only.

Assets 2

Loading

Uh oh!

There was an error while loading. Please reload this page.

danielbichuetti reacted with hooray emoji

danielbichuetti reacted with heart emoji

danielbichuetti reacted with rocket emoji

All reactions

🎉 1 reaction
❤️ 1 reaction
🚀 1 reaction

1 person reacted

Previous 1 2 … 6 7 8 9 10 … 18 19 Next

Footer

© 2025 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.