KannadaTextRecognizer

A Python-based tool that enables searching handwritten Kannada text in scanned PDF documents. It utilizes OCR (Optical Character Recognition) with Tesseract, along with image preprocessing, to extract and highlight searched phrases in the document. The project is built using libraries like pdf2image, OpenCV, pytesseract, and others.

Kannada Handwritten Text Search

This project is a Python-based tool for searching handwritten Kannada text in scanned PDF documents. Using Optical Character Recognition (OCR), the program extracts Kannada text from the images in the PDF and highlights the searched phrases in the document.

Features

Upload Kannada handwritten documents in PDF format.
Extract Kannada text from scanned document images.
Search for phrases in the extracted text.
Highlight the found phrases on the scanned document pages.
Display the processed images with highlighted search results.

Installation

Clone this repository:

git clone https://github.com/YourUsername/KannadaHandwrittenTextSearch.git
cd KannadaHandwrittenTextSearch

Install the required dependencies:

pip install pdf2image opencv-python-headless numpy scipy pytesseract
apt-get install -y poppler-utils tesseract-ocr tesseract-ocr-kan

Usage

In Google Colab

Open Google Colab.
Upload this code in Colab or copy-paste the code directly.

Install the required libraries in Colab:

!pip install pdf2image opencv-python-headless numpy scipy pytesseract
!apt-get install -y poppler-utils tesseract-ocr tesseract-ocr-kan

Run the notebook to:
- Upload a PDF with Kannada handwritten text.
- Index the document by extracting text from each page.
- Search for specific phrases in the document and highlight them on the images.

Locally (with Python)

Install the dependencies as mentioned above.

Run the Python script:

python kannada_handwritten_text_search.py

Follow the prompts to upload and search in Kannada handwritten documents.

Example Workflow

Upload a Kannada handwritten document (PDF format).
Index the document, which extracts text from each page.
Search for specific phrases in the document.
View the highlighted results in the extracted text.

Dependencies

pdf2image: Converts PDF pages into images.
opencv-python-headless: For image processing.
numpy: For numerical computations.
scipy: For spatial distance computations (optional).
pytesseract: Python wrapper for Tesseract OCR.
poppler-utils: Required for converting PDFs to images.
tesseract-ocr-kan: Kannada language OCR for Tesseract.

OCR Language

The default OCR language is set to Kannada (kan). If you wish to work with other languages,

adjust the OCR_LANGUAGE variable accordingly: OCR_LANGUAGE = 'kan' # Kannada Language Code

How It Works:

Preprocess Image: Convert PDF pages into grayscale and binarized images for better OCR results.
Extract Text: Use Tesseract to extract handwritten Kannada text from each page.
Search for Phrase: Search through the extracted text to locate the specific phrase.
Highlight Phrase: Draw bounding boxes around the found phrases in the document image and display them.

Limitations

Accuracy of text extraction may vary depending on the quality of the scanned document and handwriting clarity.
Only supports Kannada language for OCR in the current version.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contributing

Contributions are welcome! Feel free to open an issue or submit a pull request.

Acknowledgments

Tesseract OCR - An open-source OCR engine.

Google Colab - For providing free cloud-based environments.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
LICENSE		LICENSE
README.md		README.md
kannadaTextRecognizer.ipynb		kannadaTextRecognizer.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KannadaTextRecognizer

Kannada Handwritten Text Search

Features

Installation

Usage

In Google Colab

Locally (with Python)

Example Workflow

Dependencies

OCR Language

How It Works:

Limitations

License

Contributing

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KannadaTextRecognizer

Kannada Handwritten Text Search

Features

Installation

Usage

In Google Colab

Locally (with Python)

Example Workflow

Dependencies

OCR Language

How It Works:

Limitations

License

Contributing

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages