PDF Image Extractor

A Python application that extracts all embedded images from PDF documents and saves them as JPEG or PNG files. Available as both a command-line script and a user-friendly GUI application.

Prerequisites

Make sure you have Python 3.6+ installed on your system.

Installation

Navigate to the project directory:
```
cd ~/Desktop/pdf_image_extractor
```
Install the required Python packages:
```
pip install -r requirements.txt
```
Or install individually:
```
pip install PyMuPDF Pillow
```

Usage

Option 1: GUI Application (Recommended)

The easiest way to use the extractor is with the graphical interface:

Launch the GUI:
- Double-click: run_pdf_extractor.command (macOS)
- Or run: python3 pdf_image_extractor_gui.py
- Or run: python3 launch_gui.py
Use the interface:
- Click "Browse..." to select your PDF file
- Choose output format (JPEG or PNG)
- Click "Extract Images"
- Watch the progress and log output
- Choose to open the output folder when complete

Option 2: Command Line Script

Update the script: Open extract_pdf_images.py and modify these variables:
- pdf_input_path: Path to your PDF file
- output_folder: Directory where extracted images will be saved
Run the script:
```
python3 extract_pdf_images.py
```

Example

# In extract_pdf_images.py, modify these lines:
pdf_input_path = "/path/to/your/document.pdf"
output_folder = "my_extracted_images"

Then run:

python extract_pdf_images.py

Output

Images are saved as JPEG files
Naming convention: image_page{page_number}_{image_number}.jpeg
Example: image_page1_1.jpeg, image_page2_1.jpeg, etc.

Libraries Used

PyMuPDF (fitz): For PDF manipulation and image extraction
Pillow (PIL): For image processing and saving

Notes

The script creates the output directory automatically if it doesn't exist
Each image is numbered according to its page and position on that page
Error handling is included for corrupted images or PDF files

Important: Fragmented Images

Some extracted images may appear fragmented or in separate parts. This is a common occurrence in PDF image extraction and is typically due to how the original PDF was created or flattened:

PDF Structure: PDFs often store images in fragmented pieces during the document creation process
Flattening Process: When PDFs are flattened (merged into single layers), images can be split into multiple segments
Original Authoring: The fragmentation depends on how the original PDF author created or processed the document
No Automatic Solution: There is no universally reliable method to automatically reassemble fragmented images

This fragmentation is a limitation of the PDF format itself, not the extraction tool. If you encounter fragmented images, you may need to manually combine them using image editing software.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md
extract_pdf_images.py		extract_pdf_images.py
launch_gui.py		launch_gui.py
pdf_image_extractor_gui.py		pdf_image_extractor_gui.py
requirements.txt		requirements.txt
run_pdf_extractor.command		run_pdf_extractor.command
test_setup.py		test_setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Image Extractor

Prerequisites

Installation

Usage

Option 1: GUI Application (Recommended)

Option 2: Command Line Script

Example

Output

Libraries Used

Notes

Important: Fragmented Images

About

Uh oh!

Releases

Packages

Languages

License

adibzailan/pdf-image-extractor

Folders and files

Latest commit

History

Repository files navigation

PDF Image Extractor

Prerequisites

Installation

Usage

Option 1: GUI Application (Recommended)

Option 2: Command Line Script

Example

Output

Libraries Used

Notes

Important: Fragmented Images

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages