PLAYA-PDF 0.5.0: Breaking all the APIs again
There was a lot of rot and bug in various APIs, especially text and font related ones, and since ZeroVer and Reasons, it seemed like a good idea to get rid of that nonsense.
Changes from CHANGELOG.md
- Remove use of
objectin type annotations - Add support for role map and standard structure types
- Refactor page.py as it was getting really unwieldy
- Add missing
ctmto content objects in metadata API - Somewhat improve untagged text extraction where the CTM is exotic
- Correct character and word spacing to apply after all glyphs
- Correct vertical writing to fully support glyph-specific position
vectors, even totally absurd ones - Correct horizontal scaling to apply to vertical writing, including
the position vector - Add
bboxandcontentsto structure elements - Add
originanddisplacementto glyphs - Add
sizeto glyphs and texts to get effective font size (still not
entirely accurate when there is rotation or skewing) - Support PDF 2.0
Lengthattribute on inline images - Add
fontproperty to documents and pages - BREAKING:
findandfind_allin structure search by standard
structure types (roles) - BREAKING:
parent_treemoved toplaya.structure.Tree - BREAKING:
Point,Rect,MatrixandPDFObjectmoved to
playa.pdftypes - BREAKING:
PathObjectno longer contains "subpaths", it is safe to
recursively descend it now - BREAKING: Content objects moved to
playa.contentand interpreter
toplaya.interp - BREAKING: Text state no longer exists in the public API, text
objects have immutable line matrix and glyph offset now, and
everything else is in the graphic state - BREAKING:
text_space_properties are removed since what they
returned was not actually text space (and maybe not useful either) - BREAKING:
glyph_offsetis removed from glyphs and made private in
text objects, as it is not in a well defined space. - BREAKING: Glyph
bboxnow has a precise definition, which isn't
exactly the glyph bounding box but is a lot closer. This means
notably that adjacent glyphs may overlap or may not touch, which is
why you should never use thebboxto detect word boundaries.
Useoriginanddisplacementinstead, please! - BREAKING:
cid2unicodeattribute of fonts is removed as it doesn't
make any sense for Type3 or CID fonts.
What's Changed
- fix!: make type annotations much stricter by @dhdaines in #95
- feat!: Add support for role map and standard structure types by @dhdaines in #98
- Dont't split PathObject into subpaths by @lambdalemon in #85
- XObjects inherit graphic state from surrounding by @lambdalemon in #96
- fix: correct ascent/descent for Type3 fonts by @dhdaines in #99
- refactor!: split playa.page into three modules by @dhdaines in #100
- refactor!: most of text state is just graphics state by @dhdaines in #101
- refactor!: drown text state in the bathtub by @dhdaines in #102
- Correct documentation and metadata for font, text, and glyph objects by @dhdaines in #105
- Fix text rendering matrix for GlyphObject by @lambdalemon in #107
- Correct glyph and text bboxes in vertical writing mode by @dhdaines in #110 (thanks @lambdalemon for a different version of this PR)
- Make benchmarks more useful by @dhdaines in #111
- feat!: Improve text extraction and add useful glyph and text properties by @dhdaines in #112
- Correct the handling of character and word spacing parameters by @dhdaines in #113
- feat: support PDF 2.0 inline images by @dhdaines in #115
Full Changelog: v0.4.3...v0.5.0