CAPP30122 2018 Group Project By WXD
Data downloaded from https://www.kaggle.com/hugodarwood/epirecipes/data
In the cookbook folder:
- Vector_space.py, a python file containing data preprocessing and vector space model construction
- BM25.py, a python file containing data preprocessing and probabilistic BM-25 model construction
- Scrapeimage.py, a python file scraping image from google
Note: data preprocessing has already been added to Vector_space.py and BM25.py
In the evaluation folder:
- project.py, a python file that evaluates vector space model and BM25 model based on normalized discounted cumulative gain score.
for Vector_space.py and BM25.py, the documents are originally generated by our save_func function. We delete the save_func steps and directly load the following documents to shorten running time.
- For Vector_space.py, doc_length, documents, index_in_json, inverted_index, word_set are saved
- For BM25.py, doc_length_BM25, documents_BM25, index_in_json_BM25, inverted_index_BM25, word_set_BM25 are saved
- The following documents are also generated by the save_func function.
doc_length_BM25_removed_stop_words, doc_length_tfdf_removed_stop_words, documents_BM25_removed_stop_words, documents_tfdf_removed_stop_words, index_in_json_BM25_removed_stop_words, index_in_json_tfidf_removed_stop_words, inverted_index_BM25_removed_stop_words, inverted_index_tfidf_removed_stop_words, word_set_BM25_removed_stop_words, word_set_tfidf_removed_stop_words
- evaluation.xlsx includes our results for evaluation
- Install pip in your terminal (if have not done so)
- To run the project using Django user interface: $ pip install Django
- Go to “cookbook” folder, run: $ python3 manage.py runserver
- Go to 127.0.0.1:8000/search/
- If you want to use a different model, go to views.py in cookbook/search/views.py, change "from BM25 import * " to "from Vector_space import * "
- Please change the directory into "evaluation" folder.
- Run python project.py.
- You can see the scoring process shown in the terminal and it will print the score for ten queries we selected for both TFIDF model and BM25 model.
- Please open the evaluation.xlsx file to see the histogram for comparing NDCG value for two models.
- Data Preprocessing: Wenxi Xiao
- Build models and algorithms: Lerong Wang, Wenxi Xiao, Yangyang Dai
- Django user interface: Lerong Wang, Yangyang Dai
- Model Evaluation: Wenxi Xiao
- "Direct copy" ~ Learned from online sources, Generated by installed package (Django or other) and few edits made
- "Modified" ~ Generated by installed package (Django or other) and meaningful edits made OR heavily utilized template(s) provided by tutorial sessions (TA- or Django-generated)
- "Original" ~ Original code or heavily modified given structure