A Python application that extracts all embedded images from PDF documents and saves them as JPEG or PNG files. Available as both a command-line script and a user-friendly GUI application.
Make sure you have Python 3.6+ installed on your system.
-
Navigate to the project directory:
cd ~/Desktop/pdf_image_extractor
-
Install the required Python packages:
pip install -r requirements.txt
Or install individually:
pip install PyMuPDF Pillow
The easiest way to use the extractor is with the graphical interface:
-
Launch the GUI:
- Double-click:
run_pdf_extractor.command
(macOS) - Or run:
python3 pdf_image_extractor_gui.py
- Or run:
python3 launch_gui.py
- Double-click:
-
Use the interface:
- Click "Browse..." to select your PDF file
- Choose output format (JPEG or PNG)
- Click "Extract Images"
- Watch the progress and log output
- Choose to open the output folder when complete
-
Update the script: Open
extract_pdf_images.py
and modify these variables:pdf_input_path
: Path to your PDF fileoutput_folder
: Directory where extracted images will be saved
-
Run the script:
python3 extract_pdf_images.py
# In extract_pdf_images.py, modify these lines:
pdf_input_path = "/path/to/your/document.pdf"
output_folder = "my_extracted_images"
Then run:
python extract_pdf_images.py
- Images are saved as JPEG files
- Naming convention:
image_page{page_number}_{image_number}.jpeg
- Example:
image_page1_1.jpeg
,image_page2_1.jpeg
, etc.
- PyMuPDF (fitz): For PDF manipulation and image extraction
- Pillow (PIL): For image processing and saving
- The script creates the output directory automatically if it doesn't exist
- Each image is numbered according to its page and position on that page
- Error handling is included for corrupted images or PDF files
Some extracted images may appear fragmented or in separate parts. This is a common occurrence in PDF image extraction and is typically due to how the original PDF was created or flattened:
- PDF Structure: PDFs often store images in fragmented pieces during the document creation process
- Flattening Process: When PDFs are flattened (merged into single layers), images can be split into multiple segments
- Original Authoring: The fragmentation depends on how the original PDF author created or processed the document
- No Automatic Solution: There is no universally reliable method to automatically reassemble fragmented images
This fragmentation is a limitation of the PDF format itself, not the extraction tool. If you encounter fragmented images, you may need to manually combine them using image editing software.