This project uses the Amazon Reviews 2023 dataset hosted on Hugging Face: McAuley-Lab/Amazon-Reviews-2023.
We specifically pull the All Beauty subset using the following configurations:
- Reviews:
raw_review_All_Beauty - Product metadata:
raw_meta_All_Beauty
The pipeline downloads both components and joins them to support a product-review search experience.
- Review data includes fields such as review rating, review title, review text, and whether the purchase was verified.
- Metadata includes fields such as product title, average rating, price, description, store, and product details.
Data is downloaded from Hugging Face via the datasets library. The pipeline:
- Downloads the review and metadata splits (if not already present).
- Saves them as parquet files locally.
- Builds a merged parquet file used by downstream model-building scripts.
Key output files (created by make data / src/download_data.py):
data/processed/reviews.parquetdata/processed/meta.parquetdata/processed/merged.parquet
When merging reviews with product metadata, we keep the following columns:
From reviews:
ratingtitle(review title)text(review body)verified_purchase
From metadata:
product_titleaverage_ratingpricedescriptionstoredetails
The join is performed on the product identifier:
parent_asin
Two retrieval approaches are supported, and each uses slightly different preprocessing.
BM25 preprocessing (lexical retrieval)
- Text is lowercased
- Punctuation is removed (non-alphanumeric replaced with whitespace)
- Tokenization is done by whitespace splitting
- English stopwords are removed (NLTK stopwords)
- A combined text field is built from:
- review
title+ reviewtext+product_title
- review
Artifacts created by src/build_bm25.py:
data/processed/documents.parquet(tabular documents used for displaying results)data/processed/tokenized_corpus.pkl(pre-tokenized corpus)models/bm25_model.pkl(serialized BM25 model)
Semantic preprocessing (embedding retrieval)
- A combined text field is built from:
product_title+ reviewtext
- Missing values are filled with empty strings
- SentenceTransformer embeddings are computed and stored on disk
Artifacts created by src/build_semantic.py:
data/processed/documents.pkl(list of combined texts)data/processed/embeddings.npy(dense embeddings)data/processed/faiss_index/index.faiss(FAISS index)
The Shiny app supports multiple retrieval methods (selected in the UI):
- BM25
- Semantic
- Hybrid (available in the UI; combines signals from both approaches)
- Load cached artifacts:
data/processed/documents.parquetmodels/bm25_model.pkl
- Preprocess the user query using the same tokenization rules as the corpus.
- Score documents using BM25.
- Return the top k results with a BM25
score.
In the app, BM25 results are returned with:
product_title,text(truncated for display),score,rating
- Load the combined-text documents and the FAISS index.
- Embed the user query using a SentenceTransformer model (
all-MiniLM-L6-v2). - Retrieve nearest neighbors from FAISS (L2 distance on embeddings).
- Return the top k results with a distance-based similarity signal.
We implemented a Retrieval-Augmented Generation (RAG) pipeline with two retrieval modes:
- Semantic RAG: uses dense embeddings + FAISS to retrieve relevant context.
- Hybrid RAG: combines BM25 (lexical) and semantic retrieval signals (bm25 + semantic) before constructing the context for generation.
New scripts added:
RAG_pipeline.py: end-to-end semantic RAG pipeline.hybrid_search.py: hybrid retrieval (BM25 + semantic) utility.hybrid_RAG.py: end-to-end hybrid RAG pipeline.
flowchart TD
A([User query]) --> B[Encode query\nMiniLM-L6-v2]
B --> C[FAISS semantic search\nIndexFlatIP · cosine similarity]
D[(products.parquet)] --> C
C --> E[Top-k products retrieved\ntitle · description · score · rating]
E --> F[Format context\nretrieve_context]
G[System instruction] --> H
F --> H[Prompt template\ncontext + question injected]
H --> I([LLM · Llama-3-8B\nGrounded recommendation])
We implemented a quantitative evaluation feature to evaluate the performance of our final product (The app) using precision and recall metrics.
The final document can be located at ../results/final_discussion.md
Using the RAG Tab;
QUERY: 'highly rated hair product that works for both men and women'
Expected response: Based on the provided information, the highly rated hair product that works for both men and women is the Curl Defining Cream Activator for Soft Beautiful Curls by Osensia. It is rated 5.0 and is described as suitable for curly hair, which can apply to both men and women.
From the repository root:
- Clone the Repository
git clone git@github.com:UBC-MDS/DSCI_575_project_barafat2_moham136.gitThen navigate into the project folder
-
Create and activate the conda environment:
conda env create -f environment.yml conda activate dsci-575-project
-
Ensure
makeis available:conda install -c conda-forge make
-
Build everything and launch the app:
make all
This runs:
python src/download_data.pypython src/build_bm25.pypython src/build_semantic.pypython src/RAG_pipeline.pypython src/hybrid_RAG.pyshiny run app/app.py
After the first full build, you can run only the app:
make appFrom the repository root (with your environment activated):
python src/download_data.py
python src/build_bm25.py
python src/build_semantic.py
python src/RAG_pipeline.py
python src/hybrid_RAG.py
shiny run app/app.pyOpen the URL printed in the terminal to use the application.