0.11.1

pravin-unstructured released this 29 Nov 21:48

· 670 commits to main since this release

341f0f4

0.11.1

Enhancements

Use pikepdf to repair invalid PDF structure for PDFminer when we see error PSSyntaxError when PDFminer opens the document and creates the PDFminer pages object or processes a single PDF page.
Batch Source Connector support For instances where it is more optimal to read content from a source connector in batches, a new batch ingest doc is added which created multiple ingest docs after reading them in in batches per process.

Features

Staging Brick for Coco Format Staging brick which converts a list of Elements into Coco Format.
Adds HubSpot connector Adds connector to retrieve call, communications, emails, notes, products and tickets from HubSpot

Fixes

Do not extract text of <style> tags in HTML. <style> tags containing CSS in invalid positions previously contributed to element text. Do not consider text node of a <style> element as textual content.
Fix DOCX merged table cell repeats cell text. Only include text for a merged cell, not for each underlying cell spanned by the merge.
Fix tables not extracted from DOCX header/footers. Headers and footers in DOCX documents skip tables defined in the header and commonly used for layout/alignment purposes. Extract text from tables as a string and include in the Header and Footer document elements.
Fix output filepath for fsspec-based source connectors. Previously the base directory was being included in the output filepath unnecessarily.

Assets 2