The Website
A powerful Streamlit application that uses Optical Character Recognition (OCR) to extract text from images and PDF files. The app employs custom image preprocessing techniques to enhance OCR accuracy and provide a user-friendly text extraction experience for multiple files.
- Multi-file upload support (PNG, JPG, JPEG, PDF)
- Advanced image preprocessing techniques
- Configurable OCR options
- Real-time text extraction
- Downloadable extracted text files
- Python 3.7+
- Streamlit
- OpenCV
- Pytesseract
- PyMuPDF
- NumPy
- Pillow
- Deskewing
- Binarization
- Noise removal
- Contrast enhancement
- Page segmentation mode selection
- LSTM neural network mode
- Uniform text block assumption
- Interword space preservation
-
Clone the repository:
git clone https://github.com/PhoenixAlpha23/Pytesseract-Webapp/main cd main -
Install dependencies:
pip install -r requirements.txt
-
Install Tesseract for OCR functionalities:
- Ubuntu:
sudo apt-get install tesseract-ocr - macOS:
brew install tesseract - Windows: Download from Tesseract GitHub
- Ubuntu:
-
Run the Streamlit app:
streamlit run app.py
-
Open
http://localhost:8501in your web browser -
Upload images or PDF files
-
Configure OCR options in the sidebar
-
Download extracted text files
Deploy using Streamlit Cloud:
- Push code to GitHub
- Connect Streamlit Cloud to the repository
- Configure build settings
Contributions are welcome! Please submit pull requests to this main repository.
MIT License - see LICENSE file for details.