file-processing-ocr

The file-processing-ocr library is an extension of the file-processing library, designed to add Optical Character Recognition (OCR) functionality to the core file processing capabilities. This library is built as a decorator, allowing it to wrap around relevant file types (e.g., images and PDFs) to extract text content when OCR is required.

Features

OCR Text Extraction: Extracts text from images and PDF files, storing the results as part of the file’s metadata.
Decorator Pattern: Designed as a decorator to seamlessly add OCR functionality to the base File class in file-processing.
Lazy Import: Loaded by file-processing only when needed, ensuring lightweight usage when OCR is not required.
Error Handling: Includes custom error handling for missing dependencies and OCR processing issues.

Installation

To install the file-processing-ocr library from GitHub, use the following command:

pip install git+https://github.com/hc-sc-ocdo-bdpd/file-processing-ocr.git

Ensure that Tesseract OCR is installed and correctly configured in your system path, as it’s required for OCR processing.

Quick Start

To begin using file-processing-ocr with file-processing, initialize a File object and apply OCR using the OCRDecorator class:

from file_processing import File
from file_processing_ocr.ocr_decorator import OCRDecorator

# Initialize a File object
file = File('path/to/your/image_or_pdf_file.pdf')

# Wrap the file processor with OCR capabilities
ocr_file = OCRDecorator(file)

# Process the file and extract OCR text
ocr_file.process()

# Access the OCR text
print(ocr_file.metadata.get('ocr_text', 'No OCR text extracted'))

Architecture

The file-processing-ocr library applies the Decorator Pattern by wrapping an existing File processor to add OCR functionality. It leverages the Tesseract OCR engine to extract text content from images and PDFs and stores this text within the file metadata.

How It Works

Image Files: Directly applies OCR to image file formats (e.g., .png, .jpg) using pytesseract.
PDF Files: Applies OCR to embedded images within PDF pages via pypdf and pytesseract.
Fallback Handling: If Tesseract is not installed, the decorator raises a custom TesseractNotFound error.

Error Handling

OCRProcessingError: Raised if an issue occurs during OCR processing.
TesseractNotFound: Raised if Tesseract OCR is not found in the expected system path.

Custom Error Example

from file_processing_ocr.errors import TesseractNotFound, OCRProcessingError

try:
    ocr_file.process()
except TesseractNotFound as e:
    print(f"Tesseract OCR not found: {e}")
except OCRProcessingError as e:
    print(f"Error during OCR processing: {e}")

Contributing

Contributions are welcome! Please follow these steps:

Fork the Repository: Create your fork on GitHub.
Create a Feature Branch: Work on your feature in a separate branch.
Write Tests: Ensure any changes are covered by tests.
Submit a Pull Request: When ready, submit a PR for review.

License

This project is licensed under the MIT License.

Contact

For questions or support, please contact:

Email: ocdo-bdpd@hc-sc.gc.ca

Explore our file-processing suite and extend its capabilities with OCR!

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
file_processing_ocr		file_processing_ocr
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

file-processing-ocr

Table of Contents

Features

Installation

Quick Start

Architecture

How It Works

Error Handling

Custom Error Example

Contributing

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

file-processing-ocr

Table of Contents

Features

Installation

Quick Start

Architecture

How It Works

Error Handling

Custom Error Example

Contributing

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages