A simple OCR script to convert PDF files to DOCX or TXT formats using Tesseract OCR. This script is designed to extract text without preserving the original formatting of the document.
Clone the repository to your local machine:
git clone https://github.com/lancer1911/pdf2text.git
cd pdf2text
Alternatively, you can download the ZIP file and extract it to your desired location.
Open your terminal and create a new Conda environment:
conda create -n ocr_env python=3.9
Activate the newly created environment:
conda activate ocr_env
Install the required libraries using the requirements.txt
file:
pip install -r requirements.txt
- Download and install Tesseract OCR from Tesseract at UB Mannheim.
- Download and install Poppler from Poppler for Windows.
- Add Tesseract and Poppler to your system PATH.
- Ensure the
TESSDATA_PREFIX
environment variable points to the Tesseract language data directory. You can set it in your terminal:
setx TESSDATA_PREFIX "C:\Program Files\Tesseract-OCR\tessdata"
Install Tesseract OCR and Poppler using Homebrew:
brew install tesseract
brew install tesseract-lang
brew install poppler
Ensure the TESSDATA_PREFIX
environment variable points to the Tesseract language data directory. You can set it in your terminal:
export TESSDATA_PREFIX=/usr/local/share/
Install Tesseract OCR and Poppler using your package manager. For example, on Ubuntu:
sudo apt update
sudo apt install tesseract-ocr poppler-utils
To install additional language packs for Tesseract, you can use the following command:
sudo apt install tesseract-ocr-<language-code>
Replace <language-code>
with the code for the language you want to install. Here are some examples:
eng
for Englishchi-sim
for Simplified Chinesechi-tra
for Traditional Chinesedeu
for Germanjpn
for Japanesefra
for French
To install multiple languages at once, you can use:
sudo apt install tesseract-ocr-eng tesseract-ocr-chi-sim tesseract-ocr-chi-tra tesseract-ocr-deu tesseract-ocr-jpn tesseract-ocr-fra
Ensure the TESSDATA_PREFIX
environment variable points to the Tesseract language data directory. You can set it in your terminal:
export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata/
In the directory containing pdf2text.py
, use the following commands to run the script:
-
To perform OCR and output as DOCX format with Chinese and English recognition, specifying the PDF file through the command line:
python pdf2text.py -f path_to_file.pdf -o docx -l ce
-
To perform OCR and output as TXT format with Chinese recognition, specifying the PDF file through the command line:
python pdf2text.py -f path_to_file.pdf -o txt -l c
-
To perform OCR and output as DOCX format with default English recognition, using a GUI to select the PDF file:
python pdf2text.py
The script pdf2text.py
performs the following steps:
- Converts the PDF to images.
- Applies OCR on each image using Tesseract.
- Outputs the extracted text to either a DOCX or TXT file.
This script is intended for basic text extraction and does not retain the original formatting of the PDF.
This project is licensed under the MIT License.