You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This commit was created on GitHub.com and signed with GitHub’s verified signature.
The key has expired.
0.10.16
Enhancements
Adds data source properties to Airtable, Confluence, Discord, Elasticsearch, Google Drive, and Wikipedia connectors These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
DOCX partitioner refactored in preparation for enhancement. Behavior should be unchanged except in multi-section documents containing different headers/footers for different sections. These will now emit all distinct headers and footers encountered instead of just those for the last section.
Add a function to map between Tesseract and standard language codes. This allows users to input language information to the languages param in any Tesseract-supported langcode or any ISO 639 standard language code.
Features
Fixes
*Fixes an issue that caused a partition error for some PDF's. Fixes GH Issue 1460 by bypassing a coordinate check if an element has invalid coordinates.