Skip to content
Discussion options

You must be logged in to vote

Yeah, that’s a fair concern. The move to pypdf wasn’t just cosmetic—it was mainly about stability and long-term maintenance since PyPDF2 was deprecated and increasingly flaky with newer PDFs. That said, no PDF library fully “solves” broken text order or bad encodings; those problems live inside the PDFs themselves. What this project does better now is fail more predictably and give cleaner extraction in common cases, instead of silently mangling text. For truly messy PDFs, the limitation is acknowledged rather than hidden, and the modular setup makes it easier to improve or swap out the PDF layer again if something better comes along.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by zoraniy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants