The file-processing-ocr library is an extension of the file-processing library, designed to add Optical Character Recognition (OCR) functionality to the core file processing capabilities. This library is built as a decorator, allowing it to wrap around relevant file types (e.g., images and PDFs) to extract text content when OCR is required.
- OCR Text Extraction: Extracts text from images and PDF files, storing the results as part of the file’s metadata.
- Decorator Pattern: Designed as a decorator to seamlessly add OCR functionality to the base
Fileclass infile-processing. - Lazy Import: Loaded by
file-processingonly when needed, ensuring lightweight usage when OCR is not required. - Error Handling: Includes custom error handling for missing dependencies and OCR processing issues.
To install the file-processing-ocr library from GitHub, use the following command:
pip install git+https://github.com/hc-sc-ocdo-bdpd/file-processing-ocr.gitEnsure that Tesseract OCR is installed and correctly configured in your system path, as it’s required for OCR processing.
To begin using file-processing-ocr with file-processing, initialize a File object and apply OCR using the OCRDecorator class:
from file_processing import File
from file_processing_ocr.ocr_decorator import OCRDecorator
# Initialize a File object
file = File('path/to/your/image_or_pdf_file.pdf')
# Wrap the file processor with OCR capabilities
ocr_file = OCRDecorator(file)
# Process the file and extract OCR text
ocr_file.process()
# Access the OCR text
print(ocr_file.metadata.get('ocr_text', 'No OCR text extracted'))The file-processing-ocr library applies the Decorator Pattern by wrapping an existing File processor to add OCR functionality. It leverages the Tesseract OCR engine to extract text content from images and PDFs and stores this text within the file metadata.
- Image Files: Directly applies OCR to image file formats (e.g., .png, .jpg) using
pytesseract. - PDF Files: Applies OCR to embedded images within PDF pages via
pypdfandpytesseract. - Fallback Handling: If Tesseract is not installed, the decorator raises a custom
TesseractNotFounderror.
- OCRProcessingError: Raised if an issue occurs during OCR processing.
- TesseractNotFound: Raised if Tesseract OCR is not found in the expected system path.
from file_processing_ocr.errors import TesseractNotFound, OCRProcessingError
try:
ocr_file.process()
except TesseractNotFound as e:
print(f"Tesseract OCR not found: {e}")
except OCRProcessingError as e:
print(f"Error during OCR processing: {e}")Contributions are welcome! Please follow these steps:
- Fork the Repository: Create your fork on GitHub.
- Create a Feature Branch: Work on your feature in a separate branch.
- Write Tests: Ensure any changes are covered by tests.
- Submit a Pull Request: When ready, submit a PR for review.
This project is licensed under the MIT License.
For questions or support, please contact:
- Email: ocdo-bdpd@hc-sc.gc.ca
Explore our file-processing suite and extend its capabilities with OCR!