A Python-based tool that enables searching handwritten Kannada text in scanned PDF documents. It utilizes OCR (Optical Character Recognition) with Tesseract, along with image preprocessing, to extract and highlight searched phrases in the document. The project is built using libraries like pdf2image, OpenCV, pytesseract, and others.
This project is a Python-based tool for searching handwritten Kannada text in scanned PDF documents. Using Optical Character Recognition (OCR), the program extracts Kannada text from the images in the PDF and highlights the searched phrases in the document.
- Upload Kannada handwritten documents in PDF format.
- Extract Kannada text from scanned document images.
- Search for phrases in the extracted text.
- Highlight the found phrases on the scanned document pages.
- Display the processed images with highlighted search results.
-
Clone this repository:
git clone https://github.com/YourUsername/KannadaHandwrittenTextSearch.git cd KannadaHandwrittenTextSearch -
Install the required dependencies:
pip install pdf2image opencv-python-headless numpy scipy pytesseract apt-get install -y poppler-utils tesseract-ocr tesseract-ocr-kan
- Open Google Colab.
- Upload this code in Colab or copy-paste the code directly.
- Install the required libraries in Colab:
!pip install pdf2image opencv-python-headless numpy scipy pytesseract !apt-get install -y poppler-utils tesseract-ocr tesseract-ocr-kan
- Run the notebook to:
- Upload a PDF with Kannada handwritten text.
- Index the document by extracting text from each page.
- Search for specific phrases in the document and highlight them on the images.
- Install the dependencies as mentioned above.
- Run the Python script:
python kannada_handwritten_text_search.py
- Follow the prompts to upload and search in Kannada handwritten documents.
- Upload a Kannada handwritten document (PDF format).
- Index the document, which extracts text from each page.
- Search for specific phrases in the document.
- View the highlighted results in the extracted text.
- pdf2image: Converts PDF pages into images.
- opencv-python-headless: For image processing.
- numpy: For numerical computations.
- scipy: For spatial distance computations (optional).
- pytesseract: Python wrapper for Tesseract OCR.
- poppler-utils: Required for converting PDFs to images.
- tesseract-ocr-kan: Kannada language OCR for Tesseract.
The default OCR language is set to Kannada (kan). If you wish to work with other languages,
adjust the OCR_LANGUAGE variable accordingly: OCR_LANGUAGE = 'kan' # Kannada Language Code
- Preprocess Image: Convert PDF pages into grayscale and binarized images for better OCR results.
- Extract Text: Use Tesseract to extract handwritten Kannada text from each page.
- Search for Phrase: Search through the extracted text to locate the specific phrase.
- Highlight Phrase: Draw bounding boxes around the found phrases in the document image and display them.
- Accuracy of text extraction may vary depending on the quality of the scanned document and handwriting clarity.
- Only supports Kannada language for OCR in the current version.
This project is licensed under the MIT License. See the LICENSE file for details.
Contributions are welcome! Feel free to open an issue or submit a pull request.
Tesseract OCR - An open-source OCR engine.
Google Colab - For providing free cloud-based environments.