This project implements text preprocessing and query processing pipelines for Persian text data, particularly for an Information Retrieval system. It is organized into two notebooks: Phase 1 and Phase 2.
The two phases cover:
- Phase 1: Preprocessing of Persian text documents and creation of an inverted index.
- Phase 2: Implementation of a query processor supporting Boolean retrieval, phrase searching, and ranked retrieval with TF-IDF weighting and Jaccard similarity.
Purpose:
- Load and preprocess Persian text documents.
- Generate an inverted index and store term frequencies.
Key Steps:
- Load a dataset (
IR_data_news_5k.json). - Use Hazm and Parsivar for:
- Normalization
- Tokenization
- Stopword removal
- Stemming
- Build an inverted index with positional indexing.
Core Features:
- Preprocessing pipeline with stopword removal and stemming.
- Positional indexing for terms.
Purpose:
- Enhance retrieval with ranking and phrase queries.
- Support Boolean operations, TF-IDF weighting, and Jaccard similarity for query scoring.
Key Steps:
- Reuse preprocessed documents from Phase 1.
- Build champion lists for terms (top documents by importance).
- Implement retrieval techniques:
- Boolean Retrieval: AND, OR, AND-NOT operations.
- Ranked Retrieval: Cosine similarity with TF-IDF weights.
- Jaccard Similarity: Measure similarity between query and documents.
- Return top-ranked documents with metadata (title, URL).
Core Features:
- Support for Boolean queries.
- Phrase query processing with positional indexes.
- TF-IDF weighting and Cosine similarity for ranking.
- Jaccard similarity for query-document matching.
Make sure the following libraries are installed:
- Python >= 3.7
- Jupyter Notebook
- Libraries:
pip install numpy pandas hazm parsivar
-
Clone the repository:
git clone <repository-url> cd <repository-folder>
-
Run the notebooks:
- Start Jupyter Notebook:
jupyter notebook
- Open
phase1.ipynband run all cells. - Then, open
phase2.ipynbto process queries.
- Start Jupyter Notebook:
-
Input queries in
phase2.ipynb:- Boolean queries (e.g.,
مایکل ! جردن) - Phrase queries (e.g.,
"سهمیه المپیک") - Ranked retrieval using TF-IDF.
- Boolean queries (e.g.,
-
Phase 1:
- Preprocessing pipeline:
- Normalization using
Normalizer. - Tokenization using
Tokenizer. - Stemming with
FindStems.
- Normalization using
- Build inverted index with positional indexes.
- Preprocessing pipeline:
-
Phase 2:
- Query Processing:
- Boolean Retrieval: Support for AND, OR, NOT operations.
- Phrase Search: Matching terms' positions for exact phrases.
- Ranked Retrieval:
- TF-IDF weights combined with cosine similarity.
- Jaccard similarity for query-document comparison.
- Champion lists are implemented to optimize retrieval.
- Query Processing:
-
Example Results: Top-ranked documents are displayed with:
- Document frequency
- Title and URL of documents