PDF2TEXT (Simple Text Extractor)

A simple OCR script to convert PDF files to DOCX or TXT formats using Tesseract OCR. This script is designed to extract text without preserving the original formatting of the document.

Getting Started

1. Clone the Repository

Clone the repository to your local machine:

git clone https://github.com/lancer1911/pdf2text.git
cd pdf2text

Alternatively, you can download the ZIP file and extract it to your desired location.

2. Create Conda Environment

Open your terminal and create a new Conda environment:

conda create -n ocr_env python=3.9

Activate the newly created environment:

conda activate ocr_env

3. Install Dependencies

Install the required libraries using the requirements.txt file:

pip install -r requirements.txt

4. Install Tesseract OCR and Poppler

Windows

Download and install Tesseract OCR from Tesseract at UB Mannheim.
Download and install Poppler from Poppler for Windows.
Add Tesseract and Poppler to your system PATH.
Ensure the TESSDATA_PREFIX environment variable points to the Tesseract language data directory. You can set it in your terminal:

setx TESSDATA_PREFIX "C:\Program Files\Tesseract-OCR\tessdata"

macOS

Install Tesseract OCR and Poppler using Homebrew:

brew install tesseract
brew install tesseract-lang
brew install poppler

Ensure the TESSDATA_PREFIX environment variable points to the Tesseract language data directory. You can set it in your terminal:

export TESSDATA_PREFIX=/usr/local/share/

Linux

Install Tesseract OCR and Poppler using your package manager. For example, on Ubuntu:

sudo apt update
sudo apt install tesseract-ocr poppler-utils

To install additional language packs for Tesseract, you can use the following command:

sudo apt install tesseract-ocr-<language-code>

Replace <language-code> with the code for the language you want to install. Here are some examples:

eng for English
chi-sim for Simplified Chinese
chi-tra for Traditional Chinese
deu for German
jpn for Japanese
fra for French

To install multiple languages at once, you can use:

sudo apt install tesseract-ocr-eng tesseract-ocr-chi-sim tesseract-ocr-chi-tra tesseract-ocr-deu tesseract-ocr-jpn tesseract-ocr-fra

Ensure the TESSDATA_PREFIX environment variable points to the Tesseract language data directory. You can set it in your terminal:

export TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata/

5. Run the Script

In the directory containing pdf2text.py, use the following commands to run the script:

To perform OCR and output as DOCX format with Chinese and English recognition, specifying the PDF file through the command line:
```
python pdf2text.py -f path_to_file.pdf -o docx -l ce
```
To perform OCR and output as TXT format with Chinese recognition, specifying the PDF file through the command line:
```
python pdf2text.py -f path_to_file.pdf -o txt -l c
```
To perform OCR and output as DOCX format with default English recognition, using a GUI to select the PDF file:
```
python pdf2text.py
```

Script Overview

The script pdf2text.py performs the following steps:

Converts the PDF to images.
Applies OCR on each image using Tesseract.
Outputs the extracted text to either a DOCX or TXT file.

This script is intended for basic text extraction and does not retain the original formatting of the PDF.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
fonts/Noto_Sans_SC		fonts/Noto_Sans_SC
old		old
README.MD		README.MD
pdf2text.py		pdf2text.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF2TEXT (Simple Text Extractor)

Getting Started

1. Clone the Repository

2. Create Conda Environment

3. Install Dependencies

4. Install Tesseract OCR and Poppler

Windows

macOS

Linux

5. Run the Script

Script Overview

License

About

Uh oh!

Releases

Packages

Languages

lancer1911/pdf2text

Folders and files

Latest commit

History

Repository files navigation

PDF2TEXT (Simple Text Extractor)

Getting Started

1. Clone the Repository

2. Create Conda Environment

3. Install Dependencies

4. Install Tesseract OCR and Poppler

Windows

macOS

Linux

5. Run the Script

Script Overview

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages