Skip to content

Commit a3369b3

Browse files
committed
Merge branch 'main' of github.com:UBC-MDS/DSCI_575_project_barafat2_moham136
got the new README file
2 parents 573f202 + 8c5c65d commit a3369b3

1 file changed

Lines changed: 139 additions & 26 deletions

File tree

README.md

Lines changed: 139 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,43 +1,156 @@
11
# DSCI_575_project_barafat2_moham136
22

3-
## Clone the Repository
4-
```bash
5-
git clone git@github.com:UBC-MDS/DSCI_575_project_barafat2_moham136.git
6-
```
3+
## Dataset description
74

8-
Then navigate into the project folder:
9-
```bash
10-
cd DSCI_575_project_barafat2_moham136
11-
```
5+
This project uses the **Amazon Reviews 2023** dataset hosted on Hugging Face: **`McAuley-Lab/Amazon-Reviews-2023`**.
126

13-
## Install the environment and activate it
14-
```bash
15-
conda env create -f environment.yml
16-
conda activate dsci-575-project
17-
```
7+
We specifically pull the **All Beauty** subset using the following configurations:
8+
- Reviews: `raw_review_All_Beauty`
9+
- Product metadata: `raw_meta_All_Beauty`
1810

19-
## Install Make to run the Makefile
20-
```bash
21-
conda install -c conda-forge make
22-
```
11+
The pipeline downloads both components and joins them to support a product-review search experience.
12+
13+
### What the data contains (high level)
14+
- **Review data** includes fields such as review rating, review title, review text, and whether the purchase was verified.
15+
- **Metadata** includes fields such as product title, average rating, price, description, store, and product details.
16+
17+
---
18+
19+
## Data processing
20+
21+
### Download + caching
22+
Data is downloaded from Hugging Face via the `datasets` library. The pipeline:
23+
1. Downloads the review and metadata splits (if not already present).
24+
2. Saves them as parquet files locally.
25+
3. Builds a merged parquet file used by downstream model-building scripts.
26+
27+
Key output files (created by `make data` / `src/download_data.py`):
28+
- `data/processed/reviews.parquet`
29+
- `data/processed/meta.parquet`
30+
- `data/processed/merged.parquet`
31+
32+
### Fields used
33+
When merging reviews with product metadata, we keep the following columns:
34+
35+
From **reviews**:
36+
- `rating`
37+
- `title` (review title)
38+
- `text` (review body)
39+
- `verified_purchase`
40+
41+
From **metadata**:
42+
- `product_title`
43+
- `average_rating`
44+
- `price`
45+
- `description`
46+
- `store`
47+
- `details`
48+
49+
The join is performed on the product identifier:
50+
- `parent_asin`
51+
52+
### Preprocessing for retrieval
53+
Two retrieval approaches are supported, and each uses slightly different preprocessing.
2354

24-
## Run the Makefile
55+
**BM25 preprocessing (lexical retrieval)**
56+
- Text is lowercased
57+
- Punctuation is removed (non-alphanumeric replaced with whitespace)
58+
- Tokenization is done by whitespace splitting
59+
- English stopwords are removed (NLTK stopwords)
60+
- A combined text field is built from:
61+
- review `title` + review `text` + `product_title`
62+
63+
Artifacts created by `src/build_bm25.py`:
64+
- `data/processed/documents.parquet` (tabular documents used for displaying results)
65+
- `data/processed/tokenized_corpus.pkl` (pre-tokenized corpus)
66+
- `models/bm25_model.pkl` (serialized BM25 model)
67+
68+
**Semantic preprocessing (embedding retrieval)**
69+
- A combined text field is built from:
70+
- `product_title` + review `text`
71+
- Missing values are filled with empty strings
72+
- SentenceTransformer embeddings are computed and stored on disk
73+
74+
Artifacts created by `src/build_semantic.py`:
75+
- `data/processed/documents.pkl` (list of combined texts)
76+
- `data/processed/embeddings.npy` (dense embeddings)
77+
- `data/processed/faiss_index/index.faiss` (FAISS index)
78+
79+
---
80+
81+
## Retrieval workflows
82+
83+
The Shiny app supports multiple retrieval methods (selected in the UI):
84+
- **BM25**
85+
- **Semantic**
86+
- **Hybrid** (available in the UI; combines signals from both approaches)
87+
88+
### BM25 workflow (lexical)
89+
1. Load cached artifacts:
90+
- `data/processed/documents.parquet`
91+
- `models/bm25_model.pkl`
92+
2. Preprocess the user query using the same tokenization rules as the corpus.
93+
3. Score documents using BM25.
94+
4. Return the top *k* results with a BM25 `score`.
95+
96+
In the app, BM25 results are returned with:
97+
- `product_title`, `text` (truncated for display), `score`, `rating`
98+
99+
### Semantic workflow (dense retrieval)
100+
1. Load the combined-text documents and the FAISS index.
101+
2. Embed the user query using a SentenceTransformer model (`all-MiniLM-L6-v2`).
102+
3. Retrieve nearest neighbors from FAISS (L2 distance on embeddings).
103+
4. Return the top *k* results with a distance-based similarity signal.
104+
105+
---
106+
107+
## Run the app locally
108+
109+
### Option A (recommended): use the Makefile
110+
From the repository root:
111+
112+
1. Clone the Repository
25113
```bash
26-
make all
114+
git clone git@github.com:UBC-MDS/DSCI_575_project_barafat2_moham136.git
27115
```
28-
This will run the following commands in order:
29116

30-
1. `make data` - This will download the data from the specified URL and save it in the `data/raw` folder.
31-
2. `make build` - This will run all the scripts in the `src` folder to process the data and build all the models. The processed data will be saved in the `data/processed` folder and the model will be saved in the `models` folder.
32-
3. `make app` - This will run the Shiny app located in the `app` folder. The app will be available at the URL specified in terminal after running this command.
117+
Then navigate into the project folder
118+
119+
2. Create and activate the conda environment:
120+
```bash
121+
conda env create -f environment.yml
122+
conda activate dsci-575-project
123+
```
33124

34-
### This will take considerable time to run the first time.
125+
3. Ensure `make` is available:
126+
```bash
127+
conda install -c conda-forge make
128+
```
35129

36-
## Subsequently, running;
130+
4. Build everything and launch the app:
131+
```bash
132+
make all
133+
```
37134

135+
This runs:
136+
- `python src/download_data.py`
137+
- `python src/build_bm25.py`
138+
- `python src/build_semantic.py`
139+
- `shiny run app/app.py`
140+
141+
After the first full build, you can run only the app:
38142
```bash
39143
make app
40144
```
41145

42-
Should load the shiny app without recreating the files
146+
### Option B: run steps manually (no Makefile)
147+
From the repository root (with your environment activated):
148+
149+
```bash
150+
python src/download_data.py
151+
python src/build_bm25.py
152+
python src/build_semantic.py
153+
shiny run app/app.py
154+
```
43155

156+
Open the URL printed in the terminal to use the application.

0 commit comments

Comments
 (0)