Added pytesseract method to use OCR on flat pdfs. by jokerale · Pull Request #5 · Wazzabeee/copy-spotter

jokerale · 2024-04-18T12:57:05Z

Added pytesseract support in order to be able to scan flat pdfs (those that contains images as pages) and to retrieve the text inside it. Added also a little check that trigger the function using OCR when zero lines of text are found in a pdf.
Also added libraries used in the requirements files.

Wazzabeee · 2024-04-20T21:29:14Z

Buonasera 🇮🇹

Thanks for adding this! Could you rebase your PR with the latest commits of the repo? I added some checks on code quality and reformatting it should not cause conflicts with your code. Also to merge this PR it would be nice to add a pdf that contains only scanned text so that the example now supports and works with scanned text.

If you know how to It would be perfect if you could add one or more tests to test your changes.

I created this project a long time ago so I know the current code is not tested properly, but I will gradually take the time to add tests for all my functions.

Thanks in advance !

feat: add pre commit to repo fix: remove init fix: scripts structure Bump black from 23.11.0 to 24.3.0 Bumps [black](https://github.com/psf/black) from 23.11.0 to 24.3.0. - [Release notes](https://github.com/psf/black/releases) - [Changelog](https://github.com/psf/black/blob/main/CHANGES.md) - [Commits](psf/black@23.11.0...24.3.0) --- updated-dependencies: - dependency-name: black dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Bump nltk from 3.6.3 to 3.6.6 Bumps [nltk](https://github.com/nltk/nltk) from 3.6.3 to 3.6.6. - [Changelog](https://github.com/nltk/nltk/blob/develop/ChangeLog) - [Commits](nltk/nltk@3.6.3...3.6.6) --- updated-dependencies: - dependency-name: nltk dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> fix: readme & saving path feat: add setup changelog and version (Wazzabeee#8) First release fix: rename package for pypi (Wazzabeee#9) rename package from plagiarism-checker to plagiarism-detector fix: rename pypi package (Wazzabeee#10) fix: rename files with copy-spotter name feat: add tags and automatic versioning

jokerale · 2024-05-08T14:34:23Z

Bonsoir 🇫🇷

I've tried to rebase the PR with the latest commits.
Please let me know if this is the right way.

I'll add some tests for the OCR function with the added pdf in future PR.

Best

jokerale and others added 2 commits May 8, 2024 16:26

Added flat pdf.

4001070

jokerale force-pushed the OCR-BRANCH branch from d1e031e to 4001070 Compare May 8, 2024 14:29

Merge branch 'main' into OCR-BRANCH

5401514

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added pytesseract method to use OCR on flat pdfs.#5

Added pytesseract method to use OCR on flat pdfs.#5
jokerale wants to merge 3 commits intoWazzabeee:mainfrom
jokerale:OCR-BRANCH

jokerale commented Apr 18, 2024

Uh oh!

Wazzabeee commented Apr 20, 2024

Uh oh!

jokerale commented May 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jokerale commented Apr 18, 2024

Uh oh!

Wazzabeee commented Apr 20, 2024

Uh oh!

jokerale commented May 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants