DSCI_575_project_barafat2_moham136

Dataset description

This project uses the Amazon Reviews 2023 dataset hosted on Hugging Face: McAuley-Lab/Amazon-Reviews-2023.

We specifically pull the All Beauty subset using the following configurations:

Reviews: raw_review_All_Beauty
Product metadata: raw_meta_All_Beauty

The pipeline downloads both components and joins them to support a product-review search experience.

What the data contains (high level)

Review data includes fields such as review rating, review title, review text, and whether the purchase was verified.
Metadata includes fields such as product title, average rating, price, description, store, and product details.

Data processing

Download + caching

Data is downloaded from Hugging Face via the datasets library. The pipeline:

Downloads the review and metadata splits (if not already present).
Saves them as parquet files locally.
Builds a merged parquet file used by downstream model-building scripts.

Key output files (created by make data / src/download_data.py):

data/processed/reviews.parquet
data/processed/meta.parquet
data/processed/merged.parquet

Fields used

When merging reviews with product metadata, we keep the following columns:

From reviews:

rating
title (review title)
text (review body)
verified_purchase

From metadata:

product_title
average_rating
price
description
store
details

The join is performed on the product identifier:

parent_asin

Preprocessing for retrieval

Two retrieval approaches are supported, and each uses slightly different preprocessing.

BM25 preprocessing (lexical retrieval)

Text is lowercased
Punctuation is removed (non-alphanumeric replaced with whitespace)
Tokenization is done by whitespace splitting
English stopwords are removed (NLTK stopwords)
A combined text field is built from:
- review title + review text + product_title

Artifacts created by src/build_bm25.py:

data/processed/documents.parquet (tabular documents used for displaying results)
data/processed/tokenized_corpus.pkl (pre-tokenized corpus)
models/bm25_model.pkl (serialized BM25 model)

Semantic preprocessing (embedding retrieval)

A combined text field is built from:
- product_title + review text
Missing values are filled with empty strings
SentenceTransformer embeddings are computed and stored on disk

Artifacts created by src/build_semantic.py:

data/processed/documents.pkl (list of combined texts)
data/processed/embeddings.npy (dense embeddings)
data/processed/faiss_index/index.faiss (FAISS index)

Retrieval workflows

The Shiny app supports multiple retrieval methods (selected in the UI):

BM25
Semantic
Hybrid (available in the UI; combines signals from both approaches)

BM25 workflow (lexical)

Load cached artifacts:
- data/processed/documents.parquet
- models/bm25_model.pkl
Preprocess the user query using the same tokenization rules as the corpus.
Score documents using BM25.
Return the top k results with a BM25 score.

In the app, BM25 results are returned with:

product_title, text (truncated for display), score, rating

Semantic workflow (dense retrieval)

Load the combined-text documents and the FAISS index.
Embed the user query using a SentenceTransformer model (all-MiniLM-L6-v2).
Retrieve nearest neighbors from FAISS (L2 distance on embeddings).
Return the top k results with a distance-based similarity signal.

RAG Pipeline Workflow

We implemented a Retrieval-Augmented Generation (RAG) pipeline with two retrieval modes:

Semantic RAG: uses dense embeddings + FAISS to retrieve relevant context.
Hybrid RAG: combines BM25 (lexical) and semantic retrieval signals (bm25 + semantic) before constructing the context for generation.

New scripts added:

RAG_pipeline.py: end-to-end semantic RAG pipeline.
hybrid_search.py: hybrid retrieval (BM25 + semantic) utility.
hybrid_RAG.py: end-to-end hybrid RAG pipeline.

flowchart TD
    A([User query]) --> B[Encode query\nMiniLM-L6-v2]
    B --> C[FAISS semantic search\nIndexFlatIP · cosine similarity]
    D[(products.parquet)] --> C
    C --> E[Top-k products retrieved\ntitle · description · score · rating]
    E --> F[Format context\nretrieve_context]
    G[System instruction] --> H
    F --> H[Prompt template\ncontext + question injected]
    H --> I([LLM · Llama-3-8B\nGrounded recommendation])

Additional Feature

We implemented a quantitative evaluation feature to evaluate the performance of our final product (The app) using precision and recall metrics. The final document can be located at ../results/final_discussion.md

Example

Using the RAG Tab;

QUERY: 'highly rated hair product that works for both men and women'

Expected response: Based on the provided information, the highly rated hair product that works for both men and women is the Curl Defining Cream Activator for Soft Beautiful Curls by Osensia. It is rated 5.0 and is described as suitable for curly hair, which can apply to both men and women.

Run the app locally

Option A (recommended): use the Makefile

From the repository root:

Clone the Repository

git clone git@github.com:UBC-MDS/DSCI_575_project_barafat2_moham136.git

Then navigate into the project folder

Create and activate the conda environment:

conda env create -f environment.yml
conda activate dsci-575-project

Ensure make is available:
```
conda install -c conda-forge make
```
Build everything and launch the app:
```
make all
```

This runs:

python src/download_data.py
python src/build_bm25.py
python src/build_semantic.py
python src/RAG_pipeline.py
python src/hybrid_RAG.py
shiny run app/app.py

After the first full build, you can run only the app:

make app

Option B: run steps manually (no Makefile)

From the repository root (with your environment activated):

python src/download_data.py
python src/build_bm25.py
python src/build_semantic.py
python src/RAG_pipeline.py
python src/hybrid_RAG.py
shiny run app/app.py

Open the URL printed in the terminal to use the application.

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
api		api
app		app
notebooks		notebooks
results		results
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DSCI_575_project_barafat2_moham136

Dataset description

What the data contains (high level)

Data processing

Download + caching

Fields used

Preprocessing for retrieval

Retrieval workflows

BM25 workflow (lexical)

Semantic workflow (dense retrieval)

RAG Pipeline Workflow

Additional Feature

Example

Run the app locally

Option A (recommended): use the Makefile

Option B: run steps manually (no Makefile)

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DSCI_575_project_barafat2_moham136

Dataset description

What the data contains (high level)

Data processing

Download + caching

Fields used

Preprocessing for retrieval

Retrieval workflows

BM25 workflow (lexical)

Semantic workflow (dense retrieval)

RAG Pipeline Workflow

Additional Feature

Example

Run the app locally

Option A (recommended): use the Makefile

Option B: run steps manually (no Makefile)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages