0.11.1
·
670 commits
to main
since this release
0.11.1
Enhancements
-
Use
pikepdf
to repair invalid PDF structure for PDFminer when we see errorPSSyntaxError
when PDFminer opens the document and creates the PDFminer pages object or processes a single PDF page. -
Batch Source Connector support For instances where it is more optimal to read content from a source connector in batches, a new batch ingest doc is added which created multiple ingest docs after reading them in in batches per process.
Features
- Staging Brick for Coco Format Staging brick which converts a list of Elements into Coco Format.
- Adds HubSpot connector Adds connector to retrieve call, communications, emails, notes, products and tickets from HubSpot
Fixes
- Do not extract text of
<style>
tags in HTML.<style>
tags containing CSS in invalid positions previously contributed to element text. Do not consider text node of a<style>
element as textual content. - Fix DOCX merged table cell repeats cell text. Only include text for a merged cell, not for each underlying cell spanned by the merge.
- Fix tables not extracted from DOCX header/footers. Headers and footers in DOCX documents skip tables defined in the header and commonly used for layout/alignment purposes. Extract text from tables as a string and include in the
Header
andFooter
document elements. - Fix output filepath for fsspec-based source connectors. Previously the base directory was being included in the output filepath unnecessarily.