A production-style information retrieval system built on Amazon Electronics review data. The system implements and compares three retrieval strategies: BM25 keyword search, semantic vector search, and a hybrid approach using Reciprocal Rank Fusion (RRF). The application surfaces results through a clean Streamlit UI modeled after the Amazon experience. Users can search, compare methods side-by-side, and provide thumbs up/down relevance feedback that gets persisted for evaluation.
- RAG Pipeline
- Streamlit UI
- Project Structure
- Dataset
- Data Processing
- Retrieval Workflows
- Environment Setup
- Reproducing the Results
- Usage Examples
- User Feedback & Evaluation
- Authors
The pipeline has three stages: Retrieval (semantic similarity search over FAISS), Context & Prompt Construction (formatting retrieved products into a structured prompt for the LLM), and Generation (grounded answer from qwen2.5 via Ollama).
flowchart TD
A[User Query] --> B[all-MiniLM-L6-v2<br/>Sentence Transformer]
B --> C[Normalized Query Embedding<br/>384-dim vector]
C --> D[FAISS HNSW Index<br/>Inner Product / Cosine Similarity]
D --> E[Top-K Product Indices<br/>+ Similarity Scores]
E --> F[Metadata Lookup<br/>metadata_rows dict]
F --> G[Top-K Retrieved Products<br/>with product metadata + similarity scores]
G --> H[build_context<br/>combine top k products into a string]
H --> I[build_prompt<br/>system prompt + product context + user query]
I --> J[qwen2.5 LLM model via Ollama<br/>temperature=0.0]
J --> K[Grounded Answer with ASIN Citations]
G --> L["Final Output Dict<br/>(Query + LLM Answer + Retrieved Docs + Prompt Version)"]
K --> L
L --> M[LLM Answer]
M --> N[User]
subgraph Retrieval [Retrieval Stage]
B
C
D
E
F
G
end
subgraph Context_Prompt [Context and Prompt Stage]
H
I
end
subgraph Generation [Generation Stage]
J
K
end
%% Force dark text on all nodes and subgraph titles for readability
classDef default fill:#ffffff,stroke:#333,color:#000
style A fill:#e1f5ff,stroke:#0288d1,color:#000
style N fill:#e1f5ff,stroke:#0288d1,color:#000
style M fill:#c8e6c9,stroke:#388e3c,color:#000
style L fill:#c8e6c9,stroke:#388e3c,color:#000
style B fill:#fff3e0,stroke:#f57c00,color:#000
style C fill:#fff3e0,stroke:#f57c00,color:#000
style D fill:#fff3e0,stroke:#f57c00,color:#000
style E fill:#fff3e0,stroke:#f57c00,color:#000
style F fill:#fff3e0,stroke:#f57c00,color:#000
style G fill:#fff3e0,stroke:#f57c00,color:#000
style H fill:#f3e5f5,stroke:#7b1fa2,color:#000
style I fill:#f3e5f5,stroke:#7b1fa2,color:#000
style J fill:#e8f5e9,stroke:#2e7d32,color:#000
style K fill:#e8f5e9,stroke:#2e7d32,color:#000
style Retrieval fill:#fff8e1,stroke:#f57c00,color:#000
style Context_Prompt fill:#fce4ec,stroke:#7b1fa2,color:#000
style Generation fill:#e0f2e9,stroke:#2e7d32,color:#000
Attribution: Claude Sonnet 4.6 for syntax help with the Mermaid code after prompting it with the sequential steps in the RAG pipeline.
The app (app/app.py) is a single-page Streamlit dashboard styled to match Amazon's dark colour palette. It initialises a cached RAGPipeline backed by a HybridRetriever on first load, and subsequent requests reuse the same in-memory indexes and model weights, so only the first launch incurs the full build/load cost.
The interface is divided into two tabs that operate on the same query simultaneously:
| Tab | What it shows |
|---|---|
| 🔍 Search Results | Raw retrieval output - the top-k product cards ranked by hybrid score, each showing the product title, star rating, and a 200-character review snippet. |
| 🤖 RAG Answer | A grounded natural-language answer generated by qwen2.5 via Ollama, displayed in a highlighted answer panel. Below the answer, the same top-k products used as LLM context are displayed as attributed source cards. |
- Search bar — accepts free-text queries (e.g.
"Best noise cancelling headphones under $200"). - Context Documents slider — sets
top_k(1–10, default 5), controlling how many retrieved products are passed to both the result cards and the LLM context window.
Every result is rendered as a styled card (render_result_card) that displays:
- A rank pill (e.g.
Rank #1in search mode,Source [1]in RAG mode) and a retrieval score. - The product title and average star rating.
- A review snippet — the first 200 characters of a representative customer review displayed inside a dark inset box.
When the RAG tab is active, rag_pipe.invoke() is called with the query, top_k, and system prompt version V3. The response dictionary contains:
llm_answer— the generated text, rendered as HTML inside the answer panel.tool_used— if the pipeline invoked the Tavily web search tool to supplement local context, a greenUsed <tool>badge is shown next to the "AI Response" header.retrieved_docs— the top-k documents used as LLM context, re-rendered below the answer as cited source cards.
get_pipeline() is decorated with @st.cache_resource, so the HybridRetriever (which loads the FAISS index, BM25 index, and sentence-transformer model) and the RAGPipeline (which connects to the local Ollama server) are constructed exactly once per Streamlit server process and shared across all browser sessions.
├── app/
│ └── app.py # Streamlit application
├── data/
│ ├── raw/ # Raw source data (gitignored)
│ └── processed/ # Parquet corpus + cached artifacts (gitignored)
├── feedback/
│ └── user_feedback.csv # Persisted user relevance feedback
├── notebooks/
│ ├── milestone1_exploration.ipynb # EDA and preprocessing notebook
| └── milestone2_rag.ipynb # Semantic, BM25, and hybrid retriever experimentation notebook
├── results/
│ ├── milestone1_discussion.md # Qualitative evaluation write-up
| ├── milestone2_discussion.md # Model selection and system prompt evaluation write-up
| └── final_discussion.md # LLM comparison, tool use, scaling, and deployment write-up
├── src/
│ ├── bm25.py # BM25 index construction and search
│ ├── semantic.py # FAISS index construction and semantic search
│ ├── hybrid.py # RRF-based hybrid search
| ├── rag_pipeline.py # RAG pipeline with hybrid retrieval and qwen2.5 generation
| ├── prompts.py # System prompt templates for LLM grounding
│ ├── tools.py # Tavily web search tool integration
│ └── utils.py # Tokenization, I/O, and shared utilities
├── LICENSE # MIT License
├── README.md # Project overview and setup instructions
└── environment.yml # Conda environment specification
This project uses the meta_Electronics.jsonl and Electronics.jsonl datasets from Amazon Reviews 2023, which are products from the Electronics category.
Raw dataset sizes:
Electronics.jsonl: ~18.3 million customer reviewsmeta_Electronics.jsonl: ~1.6 million unique products
Subset used: Because embedding and indexing 1.6 M products is computationally prohibitive, the preprocessing notebook samples 200,000 products controlled by the constant MAX_PRODUCTS = 200_000 in notebooks/milestone1_exploration.ipynb. Products are selected by taking the first 200 k distinct parent_asin values from the metadata file (in file order). Only reviews whose parent_asin appears in that 200 k set are retained. This gives a 200,000-row retrieval corpus with one document per product.
Each document in the retrieval corpus is created by combining product metadata and review text into a single retrieval_text field. The following metadata columns are stored alongside each document for display:
| Field | Description |
|---|---|
parent_asin |
Unique product identifier |
product_title |
Full product name |
description |
Product description |
main_category |
Top-level product category |
store |
Brand/store name |
price |
Listed price |
average_rating |
Mean star rating |
rating_number |
Total number of ratings |
review_count |
Number of text reviews |
features |
Bullet-point product features |
categories |
Category hierarchy |
all_review_titles |
Concatenated review titles |
review_text_200 |
First 200 characters of a representative review |
All preprocessing is handled in notebooks/milestone1_exploration.ipynb, which outputs a retrieval_corpus.parquet file to data/processed/.
Why Polars lazy evaluation? The raw review file is ~18.3 M rows. Loading it eagerly with pandas would cause out-of-memory errors on most laptops. Instead, the notebook uses pl.scan_ndjson() to build a lazy query plan and sink_parquet() to stream results directly to disk in chunks, so that only the smaller subset of rows needed at each step are actually read into in RAM. This makes the full pipeline runnable on a standard laptop without any cloud compute.
The key preprocessing steps are:
- Streaming reads: both JSONL files are opened with
pl.scan_ndjson()(lazily to prevent OOM risk) - Sampling: 200,000 unique
parent_asinvalues are selected with.unique().limit(200_000); both the metadata and reviews frames are then filtered to that set via a semi-join - Review aggregation: reviews are grouped by
parent_asin; review titles are joined intoall_review_titlesand the first 200 characters of combined review text are stored inreview_text_200 - Joining: grouped reviews are left-joined onto product metadata so every product gets one corpus row even if it has no reviews
retrieval_textconstruction: a single searchable string is built by concatenating the product title, main category, store, features, description, average rating, categories, all review titles, and the review text snippet so it is searchable by end uses.- Persisting final corpus: the final frame is streamed to
data/processed/retrieval_corpus.parquetviasink_parquet()
At search time, utils.py handles two tokenization paths:
- Python tokenizer (
tokenize()): used for short user queries: lowercases, strips non-alphanumeric characters, and removes English stop words - Polars vectorized tokenizer (
polars_tokenize_expr()): used for bulk corpus tokenization at index build time using Polars expressions for faster tokenization speed
Implemented in src/bm25.py using the rank_bm25 library.
Build path: The corpus is loaded from Parquet in chunks, tokenized using the Polars vectorized expression, and used to construct a BM25Okapi index. Both the tokenized corpus and the BM25 index are persisted to disk as pickle files so they only need to be built once.
Fast path: On future runs, load_or_build_search_artifacts() checks for an existing BM25 index and metadata pickle, and if both exist, they are both loaded directly without re-reading the corpus at all, saving a lot of build time.
Search: The user's query is tokenized with the Python tokenizer, scored against the BM25 index with get_scores(), and the top-k results are selected using np.argpartition (avoids a full sort for speed).
Implemented in src/semantic.py using sentence-transformers and FAISS.
Model: all-MiniLM-L6-v2: a lightweight sentence embedding model that produces 384-dimensional normalized embeddings well-suited for cosine similarity search.
Index: FAISS IndexHNSWFlat with inner product metric (equivalent to cosine similarity on normalized embeddings). HNSW was chosen over IndexFlatIP for its approximate nearest neighbour approach, which gives substantially faster query times on large corpora with minimal recall loss.
Build path: The corpus is processed in chunks. Each chunk is embedded and saved to disk as a .npy file. Once all chunks are processed, the embeddings are added to the FAISS index and persisted as a .index file. Metadata rows are saved as a parallel pickle.
Fast path: If the FAISS index and metadata pickle both exist, they are loaded directly. If only the index is missing but the embedding chunk .npy files exist, the index is rebuilt from those chunks without re-embedding the entire corpus.
Search: The query is embedded with the same model (with normalize_embeddings=True), and index.search() returns the top-k nearest neighbours by cosine similarity.
Why chunk and persist embedding partitions? Embedding 200,000 documents takes several minutes and can exhaust RAM if done in one pass. The build path splits the corpus into fixed-size chunks, embeds each one, and immediately writes the result to a numbered .npy file (embedding_chunks/chunk_*.npy). This gives the following benefits: (1) persistence resilience: if the process is interrupted for some reason, already-written chunks are not re-computed on the next run; (2) less memory per chunk: only one chunk of embeddings lives in RAM at a time during build; (3) fast index rebuilds: if the FAISS .index file is deleted but the .npy chunks still exist, the index is recreated from the saved chunks in seconds without re-embedding anything.
Implemented in src/hybrid.py using Reciprocal Rank Fusion.
BM25 and semantic similarity scores live on entirely different numeric scales, so raw score fusion is not meaningful. Instead, hybrid search fuses the rank positions from each retrieval method using the RRF formula:
hybrid_score(doc) = 1 / (k + rank_bm25(doc)) + 1 / (k + rank_semantic(doc))
Where k = 60 is the standard RRF smoothing constant that dampens the influence of very high-ranked results and prevents any single method from dominating.
Each retrieval method is run with a candidate_multiplier (default 3×) so that more candidates are considered before fusing each retrieval result together. This makes sure that documents ranked highly by one method but outside the naive top-k of the other method are still captured.
Documents are then ranked by their combined hybrid score in descending order.
This project uses a conda environment. To recreate it:
First clone the repo:
git clone https://github.com/UBC-MDS/DSCI-575-amazon-review-retrieval-jasjot-karan.git
cd DSCI-575-amazon-review-retrieval-jasjot-karanThen activate the environment:
conda env create -f environment.yml
conda activate amazon-retrievalThe LLM generation step runs qwen2.5 locally via Ollama. Install it before launching the app:
brew install ollama
# or download the installer from https://ollama.com/downloadcurl -fsSL https://ollama.com/install.sh | shWindows: download the installer from https://ollama.com/download.
Then start the Ollama server and pull the model:
ollama serve # start the local server (leave this running)
ollama pull qwen2.5 # download the model (~4.7 GB, one-time)
ollama list # confirm qwen2.5 appears in the listCopy .env.example to .env and fill in any values (TAVILY_API_KEY) your local setup requires. You'll need a TAVILY_API_KEY. You can create one by signing up at
tavily.com and generating a key from the dashboard.
cp .env.example .envKey dependencies:
| Package | Version | Purpose |
|---|---|---|
python |
3.12 | Core runtime |
polars |
1.39.3 | Fast DataFrame processing for corpus loading |
faiss-cpu |
1.8.* | Vector index for semantic search |
sentence-transformers |
3.0.1 | Text embeddings |
rank-bm25 |
0.2.2 | BM25 implementation |
streamlit |
1.36.0 | Web UI |
scikit-learn |
1.5.0 | Stop words and utilities |
numpy |
1.26.* | Numerical operations |
Download processed_data.zip from the repo release: Processed data artifacts v1.0.0 Latest tag: data-1.0.0, unzip it, and move all the files inside into the data/processed/ folder in the project directory. The zip extracts a folder where the files in the folder should be in data/processed/ so the final layout looks like:
data/processed/
├── retrieval_corpus.parquet
├── bm25_index.pkl
├── faiss_index.index
├── metadata_rows.pkl
└── embedding_chunks/
└── chunk_*.npy
Once those files are in place you can skip the notebook and index-build steps entirely and run streamlit run app/app.py from the project directory to run the app on: http://localhost:8501
Place the raw Amazon Electronics dataset files into data/raw/:
Electronics.jsonl(~18.3 M reviews)meta_Electronics.jsonl(~1.6 M products)
Then run the preprocessing notebook to generate the retrieval corpus:
jupyter notebook notebooks/milestone1_exploration.ipynbThis streams both files using Polars lazy evaluation (scan_ndjson) and writes data/processed/retrieval_corpus.parquet containing 200,000 product documents.
The indexes are built automatically on first launch of the app (or by running the module scripts directly). To pre-build manually:
# Build BM25 index
python src/bm25.py
# Build FAISS semantic index
python src/semantic.pyBoth scripts persist their artifacts to data/processed/ so subsequent runs load instantly.
streamlit run app/app.pyThe app will be available at http://localhost:8501. On first launch, it builds and caches all search artifacts automatically so that future loading of artifacts is almost instant.
Once the app is running, try the following example queries to see how each retrieval method performs:
| Query Type | Example | Expected Best Method |
|---|---|---|
| Brand + model number | "Sony WH-1000XM5 noise cancelling" |
BM25 |
| Technical spec | "USB 3.2 Gen 2 SSD 1TB" |
BM25 |
| Intent / use-case | "headphones good for working from home" |
Semantic |
| Problem-based | "camera for hiking and outdoor photography" |
Semantic |
| Multi-constraint | "fast USB-C hub with HDMI for MacBook under $50" |
Hybrid |
Every result card in the Search Results tab includes a 👍 / 👎 button pair below it. Clicking either button immediately appends a record to feedback/user_feedback.csv with the following fields:
timestamp_utc,query,search_type,rank,score,feedback("up"or"down")parent_asin,product_title,main_category,average_rating,rating_number,review_count,review_text_200
Each button has a unique key scoped to the search type, query, rank, and product ASIN, so feedback for different results is always recorded independently. The CSV is created automatically on first feedback submission and appended to on every subsequent click. This data can be used to compute precision@k and other IR metrics offline.
A 30-query evaluation set (10 easy, 10 medium, and 10 difficult) and a detailed comparison of BM25 vs. semantic retrieval is documented in results/milestone1_discussion.md.
Key findings:
- BM25 excels on technical/factoid queries with exact measurements, model numbers, or brand names.
- Semantic search excels on intent-based natural language queries describing use cases or scenarios.
- Hybrid (RRF) captures the strengths of both by rewarding documents ranked highly by either method.
- Both BM25 and semantic search degrade on complex multi-constraint queries, motivating future reranking work.
| Name | GitHub |
|---|---|
| Jasjot Parmar | @jasjotp |
| Karan Bains | @karanbayns |