Text Extraction Application

Overview

A powerful Streamlit application that uses Optical Character Recognition (OCR) to extract text from images and PDF files. The app employs custom image preprocessing techniques to enhance OCR accuracy and provide a user-friendly text extraction experience for multiple files.

Key Features

Multi-file upload support (PNG, JPG, JPEG, PDF)
Advanced image preprocessing techniques
Configurable OCR options
Real-time text extraction
Downloadable extracted text files

Technologies

Python 3.7+
Streamlit
OpenCV
Pytesseract
PyMuPDF
NumPy
Pillow

OCR Processing Techniques

Image Preprocessing

Deskewing
Binarization
Noise removal
Contrast enhancement
Page segmentation mode selection

Pytesseract Configuration

LSTM neural network mode
Uniform text block assumption
Interword space preservation

Installation

Clone the repository:

git clone https://github.com/PhoenixAlpha23/Pytesseract-Webapp/main
cd main

Install dependencies:
```
pip install -r requirements.txt
```
Install Tesseract for OCR functionalities:
- Ubuntu: sudo apt-get install tesseract-ocr
- macOS: brew install tesseract
- Windows: Download from Tesseract GitHub

Usage

Run the Streamlit app:
```
streamlit run app.py
```
Open http://localhost:8501 in your web browser
Upload images or PDF files
Configure OCR options in the sidebar
Download extracted text files

Deployment

Deploy using Streamlit Cloud:

Push code to GitHub
Connect Streamlit Cloud to the repository
Configure build settings

Contributing

Contributions are welcome! Please submit pull requests to this main repository.

License

MIT License - see LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
.streamlit		.streamlit
tessdata		tessdata
utils		utils
LICENSE		LICENSE
README.md		README.md
app.py		app.py
packages.txt		packages.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Text Extraction Application

Overview

Key Features

Technologies

OCR Processing Techniques

Image Preprocessing

Pytesseract Configuration

Installation

Usage

Deployment

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

PhoenixAlpha23/Pytesseract-Webapp

Folders and files

Latest commit

History

Repository files navigation

Text Extraction Application

Overview

Key Features

Technologies

OCR Processing Techniques

Image Preprocessing

Pytesseract Configuration

Installation

Usage

Deployment

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages