This is a self-hosted REST API for extracting text from images and PDFs using Tesseract OCR and OCRmyPDF.
It's powered by FastAPI with built-in Swagger docs and packaged as a Docker container.
- 🖼️ OCR for image files (
.jpg,.png, etc.) - 📄 OCR for PDF files (
.pdf) usingocrmypdf - 🌍 Language support: English and Bengali (Bangla)
- 📚 OpenAPI & Swagger UI for testing and docs
- ⚙️ REST API — easy to integrate anywhere
- 🐳 Fully Dockerized (zero local dependencies)
.
├── app/
│ ├── main.py # FastAPI server with image + PDF OCR
│ └── requirements.txt # Python dependencies
├── Dockerfile # Builds image with Tesseract + OCRmyPDF
├── docker-compose.yml # Runs API container
└── README.md # Project documentation
docker-compose up --buildThis will:
- Build the image
- Install Tesseract with Bengali support
- Install OCRmyPDF and run the API server
docker-compose up -dTo stop it:
docker-compose downExtract text from an image file.
Request (multipart/form-data):
file: image file (.jpg,.png, etc.)lang: language code (optional, default iseng)
Response:
{
"text": "Extracted text here..."
}Example cURL:
curl -X POST http://localhost:8000/ocr \
-F "[email protected]" \
-F "lang=ben"Extract text from a PDF using ocrmypdf.
Request (multipart/form-data):
file: PDF filelang: language code (optional, default iseng)
Response:
{
"text": "Text extracted from the PDF file..."
}Example cURL:
curl -X POST http://localhost:8000/ocr/pdf \
-F "[email protected]" \
-F "lang=eng"After running the container, open your browser:
http://localhost:8000/docs
You'll see interactive, auto-generated documentation with a "Try it out" feature.
Currently installed:
eng– Englishben– Bengali (Bangla)
Tesseract supports 100+ languages.
To add more languages, edit the Dockerfile and install the desired language packs.
To add Arabic (ara) and Hindi (hin), update this section:
RUN apt-get update && apt-get install -y \
tesseract-ocr \
tesseract-ocr-ben \
tesseract-ocr-ara \
tesseract-ocr-hin \
...Then rebuild the image:
docker-compose up --buildYou can find the full list of available languages here:
🔗 https://github.com/tesseract-ocr/tessdata
Use the language code as the value for the lang parameter in the API.
This project is MIT licensed — feel free to use and modify it for personal or commercial use.
Pull requests are welcome! You can help by:
- Adding more language support
- Improving PDF/image processing
- Adding auth, usage limits, or queue support