This README provides a comprehensive guide to the programs developed for MSCI 541 assignments, ranging from HW1 to HW5. The primary focus is on search.py, which is the main search program in HW5. Additionally, key details about the index_engine.py, evaluator.py, and booleanAND.py programs are included.
- Ensure Python is installed. To verify, run
python --versionorpython3 --version.
- Clone the GitHub repository:
git clone <HTTPS Code>. - Ensure you are inside the repository after cloning.
- Create a virtual environment:
python -m venv myenv. - Activate the environment:
source myenv/bin/activate(MacOS/Linux) or.\myenv\Scripts\activate(Windows).
- Install necessary libraries:
pip install -r requirements.txt.
- Generates and stores document metadata from the LA Times Data file.
- Accepts the path of the source data file and the destination directory for metadata storage.
- Example command:
python index_engine.py <source path> <destination path> <Porter Stemming Boolean>. - The directory structure follows YYYY/MM/DD/<DOCNO>.txt.
- Computes effectiveness measures (e.g., average precision, NDCG) for a results file.
- Requires the absolute path of the QRELS file and the results file.
- Example command:
python evaluator.py <qrels file path> <results file path>.
- Retrieves documents using the BooleanAND algorithm.
- Requires the path of the index directory, query topics file, and desired results file path.
- Example command:
python booleanAND.py <index path> <topics file path> <results file path>.
- Interactive program using the BM25 algorithm with customizable parameters for document retrieval.
- Requires the absolute path of the index directory.
- Users can input queries, view document content, or start a new search.
- Example command:
python search.py <index directory path>.
- The LA Times data used in these assignments is protected under a course license and not included in the repository. Please use the provided test collection for the running and testing of the search engine.