OCR Document Parser (Tesseract + Streamlit)

This project performs Optical Character Recognition (OCR) on uploaded documents such as PAN Cards, Resumes, and Handwritten Notes using Tesseract OCR.
It automatically detects the document type and extracts key fields like name, date of birth, PAN number, email, etc.
A simple Streamlit web UI is provided for uploading and searching extracted fields.

Small OCR system to parse PAN cards, resumes and handwritten docs.

Backend: Tesseract (via pytesseract)
Parser: llm_parser.py (regex-based extraction + simple heuristics)
UI: Streamlit app ui_app.py
Batch runner: main.py (processes sample_docs/ and writes JSON to outputs/)

yeah! lets begin

Project Structure

ocr-document-parser/

├── llm_parser.py # Logic to clean and parse extracted text

├── main.py # Batch script to run OCR and save structured outputs as JSON

├── ocr_engine.py # Handles image-to-text extraction using Tesseract OCR

├── ui_app.py # Streamlit web app for uploading and searching documents

├── requirements.txt # Project dependencies

├── README.md # Project overview and setup instructions

├── LICENSE # MIT License

├── .gitignore # Files and folders to ignore in Git

│ ├── sample_docs/ # Example input images for testing

│ ├── handwritten.png

│ ├── pan_card.jpg

│ └── resume.jpg │

├── outputs/ # JSON files generated after running OCR

│ ├── handwritten_result.json

│ ├── pan_card_result.json

│ └── resume_result.json

│ └── .venv/ # Virtual environment (ignored by Git)

Install Python 3.8+ and Tesseract OCR

Clone or download this repo

git clone https://github.com/<Bharathyalagi>/ocr-document-parser.git

Install Python deps and tessaract:

pip install -r requirements.txt

Ubuntu/Linux
```
sudo apt install tesseract-ocr
```

Windows

https://github.com/UB-Mannheim/tesseract/wiki

Run CLI Batch
```
python main.py
```
Run web UI
```
streamlit run ui_app.py
```
Stop Streamlit server when done
```
CTRL + C
```

Note: We save parsed outputs as JSON because JSON stores structured key/value pairs (like "Name": "RAVI KUMAR"), is human-readable, and easily consumed by other tools and APIs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OCR Document Parser (Tesseract + Streamlit)

yeah! lets begin

Project Structure

Thank you

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
llm_parser.py		llm_parser.py
main.py		main.py
ocr_engine.py		ocr_engine.py
readme.md		readme.md
requirements.txt		requirements.txt
ui_app.py		ui_app.py

License

Bharathyalagi/OCR-Document-parser

Folders and files

Latest commit

History

Repository files navigation

OCR Document Parser (Tesseract + Streamlit)

yeah! lets begin

Project Structure

Thank you

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages