Skip to content

0.10.16

Compare
Choose a tag to compare
@amanda103 amanda103 released this 20 Sep 02:30
· 971 commits to main since this release
e359afa

0.10.16

Enhancements

  • Adds data source properties to Airtable, Confluence, Discord, Elasticsearch, Google Drive, and Wikipedia connectors These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
  • DOCX partitioner refactored in preparation for enhancement. Behavior should be unchanged except in multi-section documents containing different headers/footers for different sections. These will now emit all distinct headers and footers encountered instead of just those for the last section.
  • Add a function to map between Tesseract and standard language codes. This allows users to input language information to the languages param in any Tesseract-supported langcode or any ISO 639 standard language code.

Features

Fixes

  • *Fixes an issue that caused a partition error for some PDF's. Fixes GH Issue 1460 by bypassing a coordinate check if an element has invalid coordinates.