The PDF_CATEGORIZER system is a sophisticated two-phase pipeline designed for in-depth analysis and processing of large PDF collections.
- Phase 1: Classification: The system first analyzes each PDF to determine its structural complexity. It uses a combination of metadata analysis and layout inspection to classify books into categories, such as a "Simple Linear Monograph" (Level 1) or a "Standard Hierarchical Textbook" (Level 2). This crucial step identifies which books have reliable, machine-readable data.
- Phase 2: Segmentation: For books identified as having high-quality metadata, the system uses the Gemini AI to intelligently generate precise command-line instructions. It then executes these commands to automatically split the PDF into its constituent parts, such as the table of contents, individual chapters, and appendices.
This powerful workflow allows you to deconstruct an entire library of books into well-named, chapter-level files, making the content ready for targeted analysis, Retrieval-Augmented Generation (RAG), or ingestion into other AI models with limited context windows.
This system relies on Python and several external command-line tools. Ensure you have the following installed:
- Python 3: Download from python.org.
- Google Gemini API Key: Obtain one from Google AI for Developers.
- pdftk: A powerful command-line tool for manipulating PDFs. It must be installed and accessible from your system's PATH.
- Ghostscript: Required for handling certain types of PDF processing and password removal. It must be installed and in your system's PATH.
- qpdf: A command-line tool for PDF transformation, used here as a fallback for password removal. It must be installed and in your system's PATH.
- Download the Project: Unzip or clone the
PDF_CATEGORIZERproject to your local machine. - Open a Terminal: Navigate to the root directory of the
PDF_CATEGORIZERproject. - Install Dependencies: It is highly recommended to use a Python virtual environment. Install the required packages using the following command:
pip install google-generativeai python-dotenv pypdf PyMuPDF pandas
- Create Environment File: Create a new file named
.envin the project's root directory. - Set API Key: Open the
.envfile and add your Gemini API key:GEMINI_API_KEY=YOUR_API_KEY
Step 1: Place Your PDFs
Place the PDF books you wish to process into the BOOKS/ directory. The system will recursively scan any subdirectories as well.
Step 2: Run the Classification Pipeline
The first step is to classify every book in your collection. Open your terminal in the project root and run:
python pipe.pyThis script will:
- Iterate through every PDF file it finds.
- Use
metadata_checker.pyto check for a valid table of contents (bookmarks). - If no metadata is found, it will fall back to
layout_analyzer.pyto inspect the visual structure. - Send the collected evidence to the Gemini API, which assigns a structural complexity level to the book.
- Save all results to a file named
book_classifications.jsonl. This file is a detailed log of your collection and the input for the next phase.
You can also explore the graphs/ folder, which contains scripts to visualize the classification results and gain insights into your corpus.
Step 3: Run the Segmentation Pipeline
After classification is complete, you can segment the eligible books. In the same terminal, run:
python segmentation_pipe.pyThis script performs the following critical actions:
- It reads the
book_classifications.jsonlfile. - It filters for "safe" books—those that were classified using
metadata_check, as this confirms they have a reliable table of contents suitable for segmentation. - For each safe book, it sends the bookmark data to the Gemini AI, instructing it to act as an expert and return a list of
pdftkcommands to extract each chapter. - It receives the JSON response and executes each command, effectively splitting the PDF.
After running the pipeline, your project directory will contain the following:
segmented_output/: This directory contains the final result. It mirrors the structure of your originalBOOKS/folder, but inside each book's subfolder, you will find the cleanly separated PDF files (e.g.,00_Title_Page.pdf,Chapter_01_Introduction.pdf).book_classifications.jsonl: The JSONL file containing the detailed classification results for every book. You can learn more about the classification levels by readingStructuralComplexityClassificationSystem.md.logs/:segmentation.jsonl: A log detailing the success or failure of the segmentation process for each book.skipped_books.txt: A list of books that were not segmented, typically because they lacked the necessary metadata (i.e., they were classified as Level 5).