VGU_OpticalCharacterRecognition is an application designed for advanced Optical Character Recognition (OCR) and document processing. It integrates a Python FastAPI backend with a React/Vite frontend to provide a robust platform for text extraction, analysis, and finding related documents.
The application allows users to upload images, extract text using various OCR engines, identify key terms with Natural Language Processing (NLP), discover related articles through integrated search functionalities, and generate concise summaries of the content.
We provide two versions of the application as pre-built Docker images from GitHub Container Registry (GHCR):
latest-full: An image with all models pre-installed for immediate, full functionality. (~2.7 GB download size)latest-slim: A lightweight image that does not pre-install the Stanford NLP, VietOCR, EasyOCR and Gemma models. These will be downloaded on their first use, which may cause an initial delay. (~1 GB download size)
Ensure you have Docker Desktop installed and running on your system.
This application uses the Google Gemini API for its advanced OCR, NLP, and summarization features. You can obtain a free API key from Google AI Studio.
- Navigate to Google AI Studio: Open your web browser and go to https://aistudio.google.com/app/apikey.
- Create an API Key: You may be asked to sign in with your Google account. Once you are on the "API keys" page, click the "Create API key" button.
- Copy Your Key: A new API key will be generated for you. Copy this key and keep it secure. You will use this key in the next step to run the application.
Choose the version you want to use and pull the image from GHCR.
For the full version:
docker pull ghcr.io/phamxuankhoa/vgu_opticalcharacterrecognition:latest-fullFor the slim version:
docker pull ghcr.io/phamxuankhoa/vgu_opticalcharacterrecognition:latest-slimRun the application using the following command, replacing your_gemini_api_key_here with the actual Gemini API key you obtained in Step 1.
For the full version:
docker run -d -p 5173:80 -p 8000:8000 -e GEMINI_API_KEY="your_gemini_api_key_here" --name vgu_ocr_app ghcr.io/phamxuankhoa/vgu_opticalcharacterrecognition:latest-fullFor the slim version:
docker run -d -p 5173:80 -p 8000:8000 -e GEMINI_API_KEY="your_gemini_api_key_here" --name vgu_ocr_app ghcr.io/phamxuankhoa/vgu_opticalcharacterrecognition:latest-slimAdd this if you want to use google search (you will need to get your own google search key) (optional):
-e GOOGLE_API_KEY="google_api_key" -e GOOGLE_CSE_ID="google_cse_key"Once the container is running, the application will be accessible.
- Frontend Application: Open your web browser and navigate to
http://localhost:5173. - Backend API Documentation: The backend API documentation is available at
http://localhost:8000/docs.
Note: refresh the page if you cant load the engines
To stop the application, run:
docker stop vgu_ocr_appTo remove the container, run:
docker rm vgu_ocr_appTo delete the downloaded Docker image, use the docker rmi command with the image name.
For the full version:
docker rmi ghcr.io/phamxuankhoa/vgu_opticalcharacterrecognition:latest-fullFor the slim version:
docker rmi ghcr.io/phamxuankhoa/vgu_opticalcharacterrecognition:latest-slim- Image Upload and Processing: Interface for uploading images for OCR processing.
- Dynamic Engine Selection: Flexibility to choose from a wide array of engines for each processing step.
- Text Extraction: High-accuracy text extraction from image-based documents.
- Keyword Identification: Automated identification of key terms and phrases using NLP.
- Document Linking: Retrieval of relevant documents and articles based on extracted keywords.
- Content Summarization: Generation of automated summaries for extracted text.
- Containerized Deployment: Simplified setup and deployment using Docker for a consistent environment.
The application supports a variety of engines for different tasks, allowing users to select the best tool for their specific needs.
- Gemini: Powerful, multimodal OCR from Google's Gemini.
- VietOCR: Specialized engine for Vietnamese. (https://github.com/pbcquoc/vietocr)
- Pytesseract: A popular OCR engine based on Google's Tesseract.
- EasyOCR: A versatile OCR library supporting numerous languages.
- Gemini: Advanced NLP capabilities from Google's Gemini.
- Gemma: Keyword extraction using Google's Gemma models. (https://huggingface.co/google/gemma-3-270m-it)
- SpaCy: Industrial-strength NLP with pre-trained models.
- Stanza: A comprehensive NLP toolkit from Stanford University.
- Underthesea: A robust NLP toolkit specifically for Vietnamese.
- Pyvi: A simple NLP toolkit for Vietnamese language processing.
- DuckDuckGo: Provide the top 5 links from DuckDuckGo.
- DuckDuckGo Long: Provides more links for each keyword, up to 20 links
- DuckDuckGo Edu: Filters DuckDuckGo search results for educational domains.
- Google Search: Using the Google search engine (required an API key)
- Arxiv Search: Searches for academic papers and preprints on ArXiv.
- Gemini: High-quality text summarization using Google's Gemini.
- Gemma: Text summarization using Google's Gemma models. (quite slow and inaccurate with long text) (https://huggingface.co/google/gemma-3-270m-it)
- Python 3.10: Core language for backend development.
- FastAPI: High-performance web framework for building APIs.
- Uvicorn: ASGI server for running FastAPI applications.
- Requests & BeautifulSoup4: Libraries for web scraping and search functionalities.
- React: JavaScript library for building user interfaces.
- Vite: Modern frontend build tool for faster development.
- TypeScript: Typed superset of JavaScript for enhanced code quality.
- Tailwind CSS / Shadcn UI: Frameworks for designing and building the user interface and components.
- Fetch API: Standard API for making requests to the backend.
- Docker: Platform for developing, shipping, and running applications in containers.
This was developed through the collaborative efforts of:
- @minhle-120: Backend development, including the FastAPI implementation and integration of engines.
- @PhamXuanKhoa: Frontend development, including UI/UX design and the creation of React components.