Skip to content

πŸ“„ PDF Search Engine – Advanced keyword-based PDF search with logical operators, graph-based ranking, autocomplete, and highlighted exports.

License

Notifications You must be signed in to change notification settings

MilanSazdov/search-engine-pdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

48 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“„ PDF Search Engine

πŸ” PDF Search Engine is a Python-based search engine that processes a single PDF document, builds optimized data structures, and enables efficient keyword-based search. Users can enter textual queries, and the system ranks and displays relevant search results.

πŸ“– The project uses the book Data Structures and Algorithms in Python as a test document for search queries. This ensures that the search engine is tested on real-world technical content.


πŸš€ Features

βœ”οΈ Pre-indexing: The system processes the PDF upon startup, extracting and structuring the text for fast retrieval.
βœ”οΈ Ranked Search Results: Results are ranked based on keyword occurrences and additional ranking heuristics.
βœ”οΈ Multi-Word Queries: Users can enter one or more words separated by spaces, and results will be ranked accordingly.
βœ”οΈ Logical Operators: Supports AND, OR, and NOT for complex queries (e.g., python AND algorithm NOT dictionary).
βœ”οΈ Pagination: Displays a limited number of results per page, with options to view more.
βœ”οΈ Graph-Based Ranking: Page references (See page X) improve the ranking of linked pages.
βœ”οΈ Trie-Based Indexing: Efficient word search using a trie data structure.
βœ”οΈ Auto-Complete & Suggestions: Provides query completion and "Did you mean?" suggestions for misspelled words.
βœ”οΈ PDF Export & Highlighting: Saves search results as a separate PDF file with highlighted keywords.


πŸ›  Technologies & Dependencies

Python Platform License Contributions

Library Purpose
pdfminer.six Extract text from PDF files
PyPDF2 Read and write PDF documents
networkx Build a graph of page references
collections Optimized data structures (e.g., defaultdict)
difflib Find similar words (for auto-correction)
re Process regular expressions for query parsing
pickle Serialize and load pre-indexed search structures

πŸ“₯ Install dependencies

pip install pdfminer.six PyPDF2 networkx

πŸ“š Table of Contents


πŸ“– How It Works

  1. Preprocessing:

    • The script parses the PDF file, extracts text from each page, and constructs data structures.
    • A graph is built from references like "See page 45", improving search rankings.
    • A trie is constructed for fast word searching.
    • Indexed data is serialized for faster subsequent searches.
  2. Search Execution:

    • Users input a search query (single or multiple words).
    • The program processes the query using logical operators if provided.
    • Results are ranked based on occurrences, graph connectivity, and additional heuristics.
  3. Displaying Results:

    • The top results are shown with page numbers and contextual snippets where the word appears.
    • Pagination allows users to navigate through results.
    • Users can save results as a PDF with highlighted keywords.

πŸ”§ Installation & Usage

1️⃣ Clone the Repository

git clone https://github.com/your-username/pdf-search-engine.git
cd pdf-search-engine

2️⃣ Install Dependencies Manually

Run the following command to install required Python libraries:

pip install pdfminer.six PyPDF2 networkx

3️⃣ Run the Search Engine

To start the search engine, run the following command:

python main.py

4️⃣ Enter Your Search Query

Example queries:

data structures
algorithm OR graph
python NOT dictionary

πŸ“Œ Example Search Result

When you start the PDF Search Engine, the following menu appears, guiding you through different search options:

πŸ›  Available Search Options:

  • Basic search: Type a single word or multiple words separated by spaces to search for occurrences in the document.
  • Exit: Type exit in the search query to close the program.
  • Phrase search: Use double quotes ("word1 word2 word3") to search for an exact phrase in the document.
  • Logical search: Use NOT, OR, and AND to perform advanced queries (e.g., python AND algorithm NOT dictionary).
  • Autocomplete: Add * at the end of a word to get suggestions (e.g., fun* β†’ function, functionality).

Below is an example of how the search menu looks when you start the application:

Search Menu


πŸ” Base Search

After entering a search query, the system scans the document and displays ranked results.

How ranking works:

  • Keyword appearances: More occurrences = higher rank.
  • Distinct keywords bonus: More unique keywords = better ranking.
  • Page references (links bonus): If another page references the current one, it increases rank.
  • Referring keywords bonus: If a keyword appears on multiple linked pages, it adds extra points.

Below is an example of a basic search query:

Base Search


πŸ”Ž Phrase Search

Phrase search allows users to look for an exact sequence of words by enclosing them in double quotes (").

Example:

If you search for:

"data structures"

The system will only return results where "data structures" appears exactly as written, rather than separate occurrences of "data" and "structures" on the same page. This ensures that the search retrieves only results where the words appear together in the correct order.

Below is an example of phrase search results:

Phrase Search

πŸ”Ž Logical Search

Logical search allows users to refine their queries using logical operators:

  • AND – Returns results that contain both words.
  • OR – Returns results that contain at least one of the words.
  • NOT – Excludes pages containing the specified word.

Example Query:

python AND algorithm NOT dictionary
  • This query will return results that contain both "python" and "algorithm", but exclude any pages that mention "dictionary".

Below is an example of a logical search result:

Logical Search

πŸ”Ž Autocomplete

The search engine provides autocomplete suggestions when a user types a word followed by *. This helps in quickly finding relevant terms without typing the full word.

How It Works:

  • Type the beginning of a word followed by * (e.g., fun*).
  • The system will display a list of possible completions.
  • You can select an option or continue typing your query.

Example Query:

fun*

The system suggests words like:

fun
func
function
functional
functionality
functions
fund
funda
fundamental

Below is an example of the autocomplete feature in action:

Autocomplete


πŸ“‚ Search Results & Pagination

  • The program ranks and displays the top 20 search results based on relevance.
  • Search results are automatically saved as a PDF file, named according to the query (e.g., search_results_python_20250225_230419.pdf).
  • In the generated PDF, keywords are automatically highlighted, making it easier to spot relevant matches.
  • After viewing 20 results, the user is prompted with three options:
    • next β†’ View the next 20 results.
    • all β†’ Display all remaining results at once.
    • done β†’ Exit the search.

⚠️ Potential Issues and Troubleshooting

Issue Solution
ModuleNotFoundError: No module named 'pdfminer' Run pip install pdfminer.six to install the missing library.
ModuleNotFoundError: No module named 'PyPDF2' Run pip install PyPDF2 to install the missing library.
ModuleNotFoundError: No module named 'networkx' Run pip install networkx to install the missing library.
PDF file not found Make sure to update the path to your PDF file in the script before running the program.
Slow search performance Try running with pre-indexed data using pickle serialization.
UnicodeDecodeError when processing PDF Ensure the PDF file is properly encoded and not corrupted.
Graph ranking does not work as expected Verify that the script correctly extracts page references (e.g., "See page X").

πŸ”Ή Required Dependencies:
Before running the script, make sure you have installed all required dependencies:

pip install pdfminer.six PyPDF2 networkx difflib

πŸ“œ License

This project is licensed under the MIT License.
See the LICENSE file for more details.


πŸ”— Useful Links


πŸ“¬ Contact

πŸ“§ Email: [email protected]
πŸ™ GitHub: MilanSazdov


About

πŸ“„ PDF Search Engine – Advanced keyword-based PDF search with logical operators, graph-based ranking, autocomplete, and highlighted exports.

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages