Releases: datalab-to/pdftext
Fix rotation issue
- Minor rotations in a PDF would sometimes cause aggressive line breaks. This fixes that, so text doesn't get strangely spaced out. This affects less than 1% of PDFs.
- Minor speed optimizations.
Break spans more aggressively
Breaks spans on newlines in addition to hyphens.
What's Changed
- Dev by @VikParuchuri in #41
Full Changelog: v0.6.1...v0.6.2
Add subscript/superscript detection
Find if a character is a superscript or a subscript.
What's Changed
- Superscripts by @VikParuchuri in #40
Full Changelog: v0.6.0...v0.6.1
Misc bugfixes
Deduplicate characters, fix encoding.
What's Changed
- Add word level deduplication by @iammosespaulr in #35
- Dev by @VikParuchuri in #36
Full Changelog: v0.5.1...v0.6.0
Fix links to be in same span
What's Changed
- Misc bugfixes and improvements by @iammosespaulr in #32
- Bump version by @VikParuchuri in #33
- Dev by @VikParuchuri in #34
Full Changelog: v0.5.0...v0.5.1
Table and link extraction support
Summary
- Add table extraction support
- Add link support for references and external links
- Bugfixes
What's Changed
- fix: bbox sorting error by @simjak in #27
- Add table extraction by @VikParuchuri in #25
- Add support for PDF links and references by @iammosespaulr in #28
- Improved References by @iammosespaulr in #30
- Link support by @VikParuchuri in #29
New Contributors
Full Changelog: v0.4.1...v0.5.0
Pin pypdfium2
There's a bug with pypdfium 4.30.1 and text extraction - pinning to previous version.
Improved Segmentation with Heuristic-Based Approach
We’ve removed pdftext's reliance on the decision tree for segmenting spans, lines, and blocks and are now utilizing simpler heuristics for more efficient and accurate segmentation.
Fix loose charbox for quotes
Special chars don't work well with the loose charbox. We'll remove loose entirely soon, but this is an intermediate fix for an annoying issue with misplaced quotes.
Fix memory leak warnings
Close the PDF documents properly to avoid warnings + memory leaks.