Skip to content

PLAYA-PDF 0.5.0: Breaking all the APIs again

Choose a tag to compare

@dhdaines dhdaines released this 14 May 15:59
· 110 commits to main since this release
1c2a73a

There was a lot of rot and bug in various APIs, especially text and font related ones, and since ZeroVer and Reasons, it seemed like a good idea to get rid of that nonsense.

Changes from CHANGELOG.md

  • Remove use of object in type annotations
  • Add support for role map and standard structure types
  • Refactor page.py as it was getting really unwieldy
  • Add missing ctm to content objects in metadata API
  • Somewhat improve untagged text extraction where the CTM is exotic
  • Correct character and word spacing to apply after all glyphs
  • Correct vertical writing to fully support glyph-specific position
    vectors, even totally absurd ones
  • Correct horizontal scaling to apply to vertical writing, including
    the position vector
  • Add bbox and contents to structure elements
  • Add origin and displacement to glyphs
  • Add size to glyphs and texts to get effective font size (still not
    entirely accurate when there is rotation or skewing)
  • Support PDF 2.0 Length attribute on inline images
  • Add font property to documents and pages
  • BREAKING: find and find_all in structure search by standard
    structure types (roles)
  • BREAKING: parent_tree moved to playa.structure.Tree
  • BREAKING: Point, Rect, Matrix and PDFObject moved to
    playa.pdftypes
  • BREAKING: PathObject no longer contains "subpaths", it is safe to
    recursively descend it now
  • BREAKING: Content objects moved to playa.content and interpreter
    to playa.interp
  • BREAKING: Text state no longer exists in the public API, text
    objects have immutable line matrix and glyph offset now, and
    everything else is in the graphic state
  • BREAKING: text_space_ properties are removed since what they
    returned was not actually text space (and maybe not useful either)
  • BREAKING: glyph_offset is removed from glyphs and made private in
    text objects, as it is not in a well defined space.
  • BREAKING: Glyph bbox now has a precise definition, which isn't
    exactly the glyph bounding box but is a lot closer. This means
    notably that adjacent glyphs may overlap or may not touch, which is
    why you should never use the bbox to detect word boundaries.
    Use origin and displacement instead, please!
  • BREAKING: cid2unicode attribute of fonts is removed as it doesn't
    make any sense for Type3 or CID fonts.

What's Changed

  • fix!: make type annotations much stricter by @dhdaines in #95
  • feat!: Add support for role map and standard structure types by @dhdaines in #98
  • Dont't split PathObject into subpaths by @lambdalemon in #85
  • XObjects inherit graphic state from surrounding by @lambdalemon in #96
  • fix: correct ascent/descent for Type3 fonts by @dhdaines in #99
  • refactor!: split playa.page into three modules by @dhdaines in #100
  • refactor!: most of text state is just graphics state by @dhdaines in #101
  • refactor!: drown text state in the bathtub by @dhdaines in #102
  • Correct documentation and metadata for font, text, and glyph objects by @dhdaines in #105
  • Fix text rendering matrix for GlyphObject by @lambdalemon in #107
  • Correct glyph and text bboxes in vertical writing mode by @dhdaines in #110 (thanks @lambdalemon for a different version of this PR)
  • Make benchmarks more useful by @dhdaines in #111
  • feat!: Improve text extraction and add useful glyph and text properties by @dhdaines in #112
  • Correct the handling of character and word spacing parameters by @dhdaines in #113
  • feat: support PDF 2.0 inline images by @dhdaines in #115

Full Changelog: v0.4.3...v0.5.0