-
|
Hey, I was digging through the repo and noticed the recent switch from PyPDF2 to pypdf. I’m curious—did this actually fix real-world PDF parsing issues, or did it just trade one set of edge-case bugs for another? Specifically, how does the project handle messy PDFs with broken text order or weird encodings now? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
Yeah, that’s a fair concern. The move to pypdf wasn’t just cosmetic—it was mainly about stability and long-term maintenance since PyPDF2 was deprecated and increasingly flaky with newer PDFs. That said, no PDF library fully “solves” broken text order or bad encodings; those problems live inside the PDFs themselves. What this project does better now is fail more predictably and give cleaner extraction in common cases, instead of silently mangling text. For truly messy PDFs, the limitation is acknowledged rather than hidden, and the modular setup makes it easier to improve or swap out the PDF layer again if something better comes along. |
Beta Was this translation helpful? Give feedback.
Yeah, that’s a fair concern. The move to pypdf wasn’t just cosmetic—it was mainly about stability and long-term maintenance since PyPDF2 was deprecated and increasingly flaky with newer PDFs. That said, no PDF library fully “solves” broken text order or bad encodings; those problems live inside the PDFs themselves. What this project does better now is fail more predictably and give cleaner extraction in common cases, instead of silently mangling text. For truly messy PDFs, the limitation is acknowledged rather than hidden, and the modular setup makes it easier to improve or swap out the PDF layer again if something better comes along.