Skip to content

adibzailan/pdf-image-extractor

Repository files navigation

PDF Image Extractor

A Python application that extracts all embedded images from PDF documents and saves them as JPEG or PNG files. Available as both a command-line script and a user-friendly GUI application.

Prerequisites

Make sure you have Python 3.6+ installed on your system.

Installation

  1. Navigate to the project directory:

    cd ~/Desktop/pdf_image_extractor
  2. Install the required Python packages:

    pip install -r requirements.txt

    Or install individually:

    pip install PyMuPDF Pillow

Usage

Option 1: GUI Application (Recommended)

The easiest way to use the extractor is with the graphical interface:

  1. Launch the GUI:

    • Double-click: run_pdf_extractor.command (macOS)
    • Or run: python3 pdf_image_extractor_gui.py
    • Or run: python3 launch_gui.py
  2. Use the interface:

    • Click "Browse..." to select your PDF file
    • Choose output format (JPEG or PNG)
    • Click "Extract Images"
    • Watch the progress and log output
    • Choose to open the output folder when complete

Option 2: Command Line Script

  1. Update the script: Open extract_pdf_images.py and modify these variables:

    • pdf_input_path: Path to your PDF file
    • output_folder: Directory where extracted images will be saved
  2. Run the script:

    python3 extract_pdf_images.py

Example

# In extract_pdf_images.py, modify these lines:
pdf_input_path = "/path/to/your/document.pdf"
output_folder = "my_extracted_images"

Then run:

python extract_pdf_images.py

Output

  • Images are saved as JPEG files
  • Naming convention: image_page{page_number}_{image_number}.jpeg
  • Example: image_page1_1.jpeg, image_page2_1.jpeg, etc.

Libraries Used

  • PyMuPDF (fitz): For PDF manipulation and image extraction
  • Pillow (PIL): For image processing and saving

Notes

  • The script creates the output directory automatically if it doesn't exist
  • Each image is numbered according to its page and position on that page
  • Error handling is included for corrupted images or PDF files

Important: Fragmented Images

Some extracted images may appear fragmented or in separate parts. This is a common occurrence in PDF image extraction and is typically due to how the original PDF was created or flattened:

  • PDF Structure: PDFs often store images in fragmented pieces during the document creation process
  • Flattening Process: When PDFs are flattened (merged into single layers), images can be split into multiple segments
  • Original Authoring: The fragmentation depends on how the original PDF author created or processed the document
  • No Automatic Solution: There is no universally reliable method to automatically reassemble fragmented images

This fragmentation is a limitation of the PDF format itself, not the extraction tool. If you encounter fragmented images, you may need to manually combine them using image editing software.

About

just a really quick way to extract them images from those research papers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published