This project comporises Indexer, Tokenizer, QueryProcessor parts. Also, it uses a helper code named FileWorker for loading dataset and saving checkpoints for indexer and tokenizer sections.
In QueryProcessor side, we use TF-IDF algorithm for processing every user's query. Also, for determining the similarities between the user query and each document's representation, we use Cosine similarity function in vector space.
NOTE: This project's data preprocessing and augmentation parts are based on persian language.
To run this search engine, we have to run main file. First, tokenizer and indexer instances will be created. After that and with initializing the fileWorker instance, we can load dataset with either fileIndex or labeledFileIndex function from fileWorker class.
In the end, after some preprocessings, we define the queryProcessor instance with passing the indexer and the tokenizer to it's constructor. We can write our queries in terminal with calling the startListening function.