Added pytesseract method to use OCR on flat pdfs.#5
Added pytesseract method to use OCR on flat pdfs.#5jokerale wants to merge 3 commits intoWazzabeee:mainfrom
Conversation
|
Buonasera 🇮🇹 Thanks for adding this! Could you rebase your PR with the latest commits of the repo? I added some checks on code quality and reformatting it should not cause conflicts with your code. Also to merge this PR it would be nice to add a pdf that contains only scanned text so that the example now supports and works with scanned text. If you know how to It would be perfect if you could add one or more tests to test your changes. I created this project a long time ago so I know the current code is not tested properly, but I will gradually take the time to add tests for all my functions. Thanks in advance ! |
feat: add pre commit to repo fix: remove init fix: scripts structure Bump black from 23.11.0 to 24.3.0 Bumps [black](https://github.com/psf/black) from 23.11.0 to 24.3.0. - [Release notes](https://github.com/psf/black/releases) - [Changelog](https://github.com/psf/black/blob/main/CHANGES.md) - [Commits](psf/black@23.11.0...24.3.0) --- updated-dependencies: - dependency-name: black dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Bump nltk from 3.6.3 to 3.6.6 Bumps [nltk](https://github.com/nltk/nltk) from 3.6.3 to 3.6.6. - [Changelog](https://github.com/nltk/nltk/blob/develop/ChangeLog) - [Commits](nltk/nltk@3.6.3...3.6.6) --- updated-dependencies: - dependency-name: nltk dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> fix: readme & saving path feat: add setup changelog and version (Wazzabeee#8) First release fix: rename package for pypi (Wazzabeee#9) rename package from plagiarism-checker to plagiarism-detector fix: rename pypi package (Wazzabeee#10) fix: rename files with copy-spotter name feat: add tags and automatic versioning
|
Bonsoir 🇫🇷 I've tried to rebase the PR with the latest commits. I'll add some tests for the OCR function with the added pdf in future PR. Best |
Added pytesseract support in order to be able to scan flat pdfs (those that contains images as pages) and to retrieve the text inside it. Added also a little check that trigger the function using OCR when zero lines of text are found in a pdf.
Also added libraries used in the requirements files.