Skip to content

Releases: datalab-to/pdftext

Fix rotation issue

11 Jun 14:41
d76a575

Choose a tag to compare

  • Minor rotations in a PDF would sometimes cause aggressive line breaks. This fixes that, so text doesn't get strangely spaced out. This affects less than 1% of PDFs.
  • Minor speed optimizations.

Break spans more aggressively

28 Feb 23:04
4021f6e

Choose a tag to compare

Breaks spans on newlines in addition to hyphens.

What's Changed

Full Changelog: v0.6.1...v0.6.2

Add subscript/superscript detection

26 Feb 14:44
a002c7f

Choose a tag to compare

Find if a character is a superscript or a subscript.

What's Changed

Full Changelog: v0.6.0...v0.6.1

Misc bugfixes

18 Feb 18:47
c10283f

Choose a tag to compare

Deduplicate characters, fix encoding.

What's Changed

Full Changelog: v0.5.1...v0.6.0

Fix links to be in same span

28 Jan 17:10
92fd696

Choose a tag to compare

What's Changed

Full Changelog: v0.5.0...v0.5.1

Table and link extraction support

22 Jan 17:58
0a4f33c

Choose a tag to compare

Summary

  • Add table extraction support
  • Add link support for references and external links
  • Bugfixes

What's Changed

New Contributors

Full Changelog: v0.4.1...v0.5.0

Pin pypdfium2

30 Dec 20:44
ea2e9b5

Choose a tag to compare

There's a bug with pypdfium 4.30.1 and text extraction - pinning to previous version.

Improved Segmentation with Heuristic-Based Approach

12 Dec 16:12
cd9d41d

Choose a tag to compare

We’ve removed pdftext's reliance on the decision tree for segmenting spans, lines, and blocks and are now utilizing simpler heuristics for more efficient and accurate segmentation.

Fix loose charbox for quotes

03 Dec 20:39
f26428a

Choose a tag to compare

Special chars don't work well with the loose charbox. We'll remove loose entirely soon, but this is an intermediate fix for an annoying issue with misplaced quotes.

Fix memory leak warnings

19 Nov 18:32
c065ac0

Choose a tag to compare

Close the PDF documents properly to avoid warnings + memory leaks.