Skip to content

0.11.1

Compare
Choose a tag to compare
@pravin-unstructured pravin-unstructured released this 29 Nov 21:48
· 670 commits to main since this release
341f0f4

0.11.1

Enhancements

  • Use pikepdf to repair invalid PDF structure for PDFminer when we see error PSSyntaxError when PDFminer opens the document and creates the PDFminer pages object or processes a single PDF page.

  • Batch Source Connector support For instances where it is more optimal to read content from a source connector in batches, a new batch ingest doc is added which created multiple ingest docs after reading them in in batches per process.

Features

  • Staging Brick for Coco Format Staging brick which converts a list of Elements into Coco Format.
  • Adds HubSpot connector Adds connector to retrieve call, communications, emails, notes, products and tickets from HubSpot

Fixes

  • Do not extract text of <style> tags in HTML. <style> tags containing CSS in invalid positions previously contributed to element text. Do not consider text node of a <style> element as textual content.
  • Fix DOCX merged table cell repeats cell text. Only include text for a merged cell, not for each underlying cell spanned by the merge.
  • Fix tables not extracted from DOCX header/footers. Headers and footers in DOCX documents skip tables defined in the header and commonly used for layout/alignment purposes. Extract text from tables as a string and include in the Header and Footer document elements.
  • Fix output filepath for fsspec-based source connectors. Previously the base directory was being included in the output filepath unnecessarily.