|
1 | 1 | # DSCI_575_project_barafat2_moham136 |
2 | 2 |
|
3 | | -## Clone the Repository |
4 | | -```bash |
5 | | -git clone git@github.com:UBC-MDS/DSCI_575_project_barafat2_moham136.git |
6 | | -``` |
| 3 | +## Dataset description |
7 | 4 |
|
8 | | -Then navigate into the project folder: |
9 | | -```bash |
10 | | -cd DSCI_575_project_barafat2_moham136 |
11 | | -``` |
| 5 | +This project uses the **Amazon Reviews 2023** dataset hosted on Hugging Face: **`McAuley-Lab/Amazon-Reviews-2023`**. |
12 | 6 |
|
13 | | -## Install the environment and activate it |
14 | | -```bash |
15 | | -conda env create -f environment.yml |
16 | | -conda activate dsci-575-project |
17 | | -``` |
| 7 | +We specifically pull the **All Beauty** subset using the following configurations: |
| 8 | +- Reviews: `raw_review_All_Beauty` |
| 9 | +- Product metadata: `raw_meta_All_Beauty` |
18 | 10 |
|
19 | | -## Install Make to run the Makefile |
20 | | -```bash |
21 | | -conda install -c conda-forge make |
22 | | -``` |
| 11 | +The pipeline downloads both components and joins them to support a product-review search experience. |
| 12 | + |
| 13 | +### What the data contains (high level) |
| 14 | +- **Review data** includes fields such as review rating, review title, review text, and whether the purchase was verified. |
| 15 | +- **Metadata** includes fields such as product title, average rating, price, description, store, and product details. |
| 16 | + |
| 17 | +--- |
| 18 | + |
| 19 | +## Data processing |
| 20 | + |
| 21 | +### Download + caching |
| 22 | +Data is downloaded from Hugging Face via the `datasets` library. The pipeline: |
| 23 | +1. Downloads the review and metadata splits (if not already present). |
| 24 | +2. Saves them as parquet files locally. |
| 25 | +3. Builds a merged parquet file used by downstream model-building scripts. |
| 26 | + |
| 27 | +Key output files (created by `make data` / `src/download_data.py`): |
| 28 | +- `data/processed/reviews.parquet` |
| 29 | +- `data/processed/meta.parquet` |
| 30 | +- `data/processed/merged.parquet` |
| 31 | + |
| 32 | +### Fields used |
| 33 | +When merging reviews with product metadata, we keep the following columns: |
| 34 | + |
| 35 | +From **reviews**: |
| 36 | +- `rating` |
| 37 | +- `title` (review title) |
| 38 | +- `text` (review body) |
| 39 | +- `verified_purchase` |
| 40 | + |
| 41 | +From **metadata**: |
| 42 | +- `product_title` |
| 43 | +- `average_rating` |
| 44 | +- `price` |
| 45 | +- `description` |
| 46 | +- `store` |
| 47 | +- `details` |
| 48 | + |
| 49 | +The join is performed on the product identifier: |
| 50 | +- `parent_asin` |
| 51 | + |
| 52 | +### Preprocessing for retrieval |
| 53 | +Two retrieval approaches are supported, and each uses slightly different preprocessing. |
23 | 54 |
|
24 | | -## Run the Makefile |
| 55 | +**BM25 preprocessing (lexical retrieval)** |
| 56 | +- Text is lowercased |
| 57 | +- Punctuation is removed (non-alphanumeric replaced with whitespace) |
| 58 | +- Tokenization is done by whitespace splitting |
| 59 | +- English stopwords are removed (NLTK stopwords) |
| 60 | +- A combined text field is built from: |
| 61 | + - review `title` + review `text` + `product_title` |
| 62 | + |
| 63 | +Artifacts created by `src/build_bm25.py`: |
| 64 | +- `data/processed/documents.parquet` (tabular documents used for displaying results) |
| 65 | +- `data/processed/tokenized_corpus.pkl` (pre-tokenized corpus) |
| 66 | +- `models/bm25_model.pkl` (serialized BM25 model) |
| 67 | + |
| 68 | +**Semantic preprocessing (embedding retrieval)** |
| 69 | +- A combined text field is built from: |
| 70 | + - `product_title` + review `text` |
| 71 | +- Missing values are filled with empty strings |
| 72 | +- SentenceTransformer embeddings are computed and stored on disk |
| 73 | + |
| 74 | +Artifacts created by `src/build_semantic.py`: |
| 75 | +- `data/processed/documents.pkl` (list of combined texts) |
| 76 | +- `data/processed/embeddings.npy` (dense embeddings) |
| 77 | +- `data/processed/faiss_index/index.faiss` (FAISS index) |
| 78 | + |
| 79 | +--- |
| 80 | + |
| 81 | +## Retrieval workflows |
| 82 | + |
| 83 | +The Shiny app supports multiple retrieval methods (selected in the UI): |
| 84 | +- **BM25** |
| 85 | +- **Semantic** |
| 86 | +- **Hybrid** (available in the UI; combines signals from both approaches) |
| 87 | + |
| 88 | +### BM25 workflow (lexical) |
| 89 | +1. Load cached artifacts: |
| 90 | + - `data/processed/documents.parquet` |
| 91 | + - `models/bm25_model.pkl` |
| 92 | +2. Preprocess the user query using the same tokenization rules as the corpus. |
| 93 | +3. Score documents using BM25. |
| 94 | +4. Return the top *k* results with a BM25 `score`. |
| 95 | + |
| 96 | +In the app, BM25 results are returned with: |
| 97 | +- `product_title`, `text` (truncated for display), `score`, `rating` |
| 98 | + |
| 99 | +### Semantic workflow (dense retrieval) |
| 100 | +1. Load the combined-text documents and the FAISS index. |
| 101 | +2. Embed the user query using a SentenceTransformer model (`all-MiniLM-L6-v2`). |
| 102 | +3. Retrieve nearest neighbors from FAISS (L2 distance on embeddings). |
| 103 | +4. Return the top *k* results with a distance-based similarity signal. |
| 104 | + |
| 105 | +--- |
| 106 | + |
| 107 | +## Run the app locally |
| 108 | + |
| 109 | +### Option A (recommended): use the Makefile |
| 110 | +From the repository root: |
| 111 | + |
| 112 | +1. Clone the Repository |
25 | 113 | ```bash |
26 | | -make all |
| 114 | +git clone git@github.com:UBC-MDS/DSCI_575_project_barafat2_moham136.git |
27 | 115 | ``` |
28 | | -This will run the following commands in order: |
29 | 116 |
|
30 | | -1. `make data` - This will download the data from the specified URL and save it in the `data/raw` folder. |
31 | | -2. `make build` - This will run all the scripts in the `src` folder to process the data and build all the models. The processed data will be saved in the `data/processed` folder and the model will be saved in the `models` folder. |
32 | | -3. `make app` - This will run the Shiny app located in the `app` folder. The app will be available at the URL specified in terminal after running this command. |
| 117 | +Then navigate into the project folder |
| 118 | + |
| 119 | +2. Create and activate the conda environment: |
| 120 | + ```bash |
| 121 | + conda env create -f environment.yml |
| 122 | + conda activate dsci-575-project |
| 123 | + ``` |
33 | 124 |
|
34 | | -### This will take considerable time to run the first time. |
| 125 | +3. Ensure `make` is available: |
| 126 | + ```bash |
| 127 | + conda install -c conda-forge make |
| 128 | + ``` |
35 | 129 |
|
36 | | -## Subsequently, running; |
| 130 | +4. Build everything and launch the app: |
| 131 | + ```bash |
| 132 | + make all |
| 133 | + ``` |
37 | 134 |
|
| 135 | +This runs: |
| 136 | +- `python src/download_data.py` |
| 137 | +- `python src/build_bm25.py` |
| 138 | +- `python src/build_semantic.py` |
| 139 | +- `shiny run app/app.py` |
| 140 | + |
| 141 | +After the first full build, you can run only the app: |
38 | 142 | ```bash |
39 | 143 | make app |
40 | 144 | ``` |
41 | 145 |
|
42 | | -Should load the shiny app without recreating the files |
| 146 | +### Option B: run steps manually (no Makefile) |
| 147 | +From the repository root (with your environment activated): |
| 148 | + |
| 149 | +```bash |
| 150 | +python src/download_data.py |
| 151 | +python src/build_bm25.py |
| 152 | +python src/build_semantic.py |
| 153 | +shiny run app/app.py |
| 154 | +``` |
43 | 155 |
|
| 156 | +Open the URL printed in the terminal to use the application. |
0 commit comments