This project builds an information retrieval system over the Amazon Reviews 2023 dataset, focusing on the Grocery & Gourmet Food category. Given a natural language query, the system retrieves the most relevant products using two complementary approaches:
- BM25 β a classical keyword-based retrieval method using term frequency and inverse document frequency. This project uses LangChain BM25 retriever.
- Semantic Search β dense vector retrieval using sentence embeddings (
all-MiniLM-L6-v2) and a FAISS index - Hybrid Search β combines BM25 and Semantic Search using Reciprocal Rank Fusion (RRF) to produce a more robust and balanced ranking of documents
The app allows users to switch between BM25 and Semantic search modes, view top-3 product results with reviews, and provide relevance feedback (π/π) which is recorded in a CSV file. An AI Assistant tab powered by a Hybrid RAG pipeline (BM25 + Semantic + Llama-3-8B) is also available for natural language product recommendations. Additionally, the system includes a web search tool (Tavily) that augments LLM responses with live external information (e.g. current pricing, availability) when the query requires it. Both the BM25 index (603,274 products) and the FAISS semantic index (100,500 products, smaller due to computing constraints) are hosted on HuggingFace and loaded automatically by the app.
Live App: π₯π§ Grocery & Gourmet Food Search
Source: McAuley-Lab/Amazon-Reviews-2023 on HuggingFace
Category: Grocery & Gourmet Food
| File | Rows | Description |
|---|---|---|
raw_review_Grocery_and_Gourmet_Food |
14,318,520 | User reviews |
raw_meta_Grocery_and_Gourmet_Food |
603,274 | Product metadata |
We combine both datasets to enhance the search functionality with as much useful information as possible. We create langchain documents using important fields in the metadata such as title, description, categories, while fields like title and text from user reviews are used in combination. The field parent_asin is used as a combining factor between the two datasets. For more information please check the notebooks relating to data preprocessing. Major preprocessing steps involve extracting useful columns from the data, creating documents, and converting them to tokens for bm25 or embeddings for FAISS.
We utilise HuggingFace datasets package which enable us to use arrow-like SQL search using duckdb.
We use a sparse method (BM25), a dense method (FAISS), and a hybrid combination of the two. For more details please check notebooks related to bm25 and faiss.
BM25 is a sparse, keyword-based retrieval method. At query time, the query is tokenized using the same tokenizer used during preprocessing. The BM25 index scores all documents based on term frequency and inverse document frequency (TF-IDF-like), returning the top-k most relevant documents ranked by score. We return and display this score in the app. Scores are computed via vectorizer.get_scores(), which returns a numpy array of BM25 scores across all documents in the index.
Sample corpus is provided in data/processed/tokenisation
flowchart LR
reviews[("Reviews")] --> docs["Documents"]
metadata[("Metadata")] --> docs
docs --> tokens["Tokens\n(BM25 Index)"]
tokens --> scores["BM25 Scores\n(numpy array)"]
query(["Query"]) --> qtokens["Query\nTokens"]
qtokens --> scores
scores --> topk["Top-k\nDocuments"]
topk --> output["Output JSON\n(metadata)"]
output --> app["App\nHTML"]
FAISS (Facebook AI Similarity Search) is a dense, embedding-based retrieval method. At query time, the query is converted into a vector embedding using the same embedding model used during preprocessing. FAISS then performs an approximate nearest neighbour (ANN) search over the indexed document embeddings, returning the top-k most semantically similar documents by cosine similarity. We return and display this score in the app.
Sample embeddings are provided in data/processed/embeddings
flowchart LR
reviews[("Reviews")] --> docs["Documents"]
metadata[("Metadata")] --> docs
docs --> embeddings["Embeddings\n(FAISS Index)"]
embeddings --> similar["Most Similar\nEmbeddings"]
query(["Query"]) --> qembed["Query\nEmbedding"]
qembed --> similar
similar --> output["Output JSON\n(metadata)"]
output --> app["App\nHTML"]
We merge BM25 and FAISS into a hybrid retriever using Reciprocal Rank Fusion (RRF), which combines semantic similarity with keyword-based relevance to produce a more robust and balanced ranking. RRF assigns each document a score based on its rank position in each retriever's results, then sums these scores with configurable weights, allowing us to control the trade-off between contextual understanding and exact term matching. The hybrid retriever is implemented in src/hybrid.py and is the retriever used by the AI Assistant tab.
flowchart LR
query["query"] --> sem["FAISS retriever"]
query --> bm["BM25 retriever"]
sem --> semtop["Top-k semantic"]
bm --> bmtop["Top-k BM25"]
semtop --->|50% weight| comb["RRF Combined Output docs"]
bmtop --->|50% weight| comb
comb --> output["Output JSON\n(metadata)"]
output --> app["App\nHTML"]
- Clone the repository using HTTP
git clone https://github.com/UBC-MDS/DSCI_575_project_sbj1_rishadaz.gitOr, SSH
git clone git@github.com:UBC-MDS/DSCI_575_project_sbj1_rishadaz.gitNavigate to the project root
cd DSCI_575_project_sbj1_rishadaz- Create and activate the environment
conda env create -f environment.yml
conda activate 575-proj- Set up environment variables
touch .env
# Add your HuggingFace token to .env:
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxx
# Add your Tavily API key for web search augmentation (free tier at tavily.com):
TAVILY_API_KEY=tvly_xxxxxxxxxxxxxxxxxx
# To run on sample pkl/faiss files already present in github repo
DATA_SOURCE=local
# To run on full generated files from HF Datasets (takes longer time to download)
DATA_SOURCE=remote- Build the indices (optional β only needed for full local index)
The app has the option to load the full indices from HuggingFace automatically. To build locally:
BM25:
# Open and run notebooks/milestone1_bm25.ipynb
# This downloads the dataset and saves to data/processed/bm25_index.pklSemantic:
# Open and run notebooks/milestone1_semantic.ipynb
# This saves to data/processed/embeddings/- Run the app locally
streamlit run app/app.pyThe app will automatically use the full local index if available, otherwise falls back to the smaller subset files in data/processed/.
Evaluation and exploration can be generated here and are summarized here. We can see a few cases where BM25 was doing better, while in some FAISS was. We cannot compare the scores between them as they are on different scales, but we are able to see how they prioritize items. We have not implemented scoring based on other factors such as popularity or rating, and only rank the products based on their retrieval score.
RAG pipeline evaluation is generated here and summarized here.
We query the online hosted LLMs through the HuggingFace API and selected meta-llama/Meta-Llama-3-8B-Instruct with a max_token limit of 512 tokens. Since our app is small scale, we expect to have enough free-tier API calls available, and the LLM performance was quite good over many iterations we tested.
The AI Assistant tab in the app exposes the full RAG pipeline to the user β enter a grocery query and receive AI-generated product recommendations along with recipe ideas and storage tips, grounded in the retrieved product reviews and metadata.
We tested 2 kinds of retriever for the RAG - a fully semantic (FAISS) retriever and a hybrid (BM25 + semantic) retriever with equal weights. We found the hybrid retriever to work better overall and it is the sole RAG retriever used in the AI Assistant tab. In the future we can implement a slider to control the ratio of weights used in the hybrid retriever.
Both semantic and hybrid can be explored in this notebook with different prompts and parameters. The rag_pipeline object returns a tuple, where the second item will return the context retrieved from the retriever, so both can be tested simultaneously. The input verbose=True can also print the entire context which is being sent to the LLM, after each step, for more clear exploration.
Here is the basic workflow:
flowchart LR
reviews["Retriever\n(Hybrid or Semantic)"] --> docs["Top k\nDocuments"]
docs --> embeddings["Create\nPage Context"]
embeddings --> similar["Prompt"]
sys_pro["SYSTEM Prompt"] --> similar
query(["User"]) --> qembed["Query"]
qembed --> reviews
qembed --> similar
similar --> response["LLM response"] --> output["Output JSON\n(content + metadata)"]
The RAG pipeline includes an optional web search tool powered by Tavily. When a query contains keywords suggesting current or external information is needed (e.g. price, current, gluten, organic, nutrition), the system fetches live web results and injects them into the LLM context. Source URLs are displayed as clickable links in the π Web Sources section of the AI tab. The tool requires a TAVILY_API_KEY in .env (free tier available at tavily.com). If the key is missing or the call fails, the pipeline continues normally without web augmentation.
Similar to the Search function, some metrics and exploration can be generated here and are summarized here. We found that while the LLM was slightly unpredictable, for most simple grocery queries it performed well. We tried to depend as little as possible on the output formatting to avoid breaking of code in edge cases, e.g. when the LLM does not return the parent_asin numbers.
Disclaimer LLM-based pipelines may occasionally produce inaccurate or unexpected results. Since this application handles food and recipe-related queries, any guidance on cooking, storage, or handling should be independently verified before use. Prompting should be done carefully to avoid hallucinations.
The app is deployed on HuggingFace Spaces at the link provided in the description. The Dataset with both bm25 and faiss embeddings are also hosted as a HuggingFace Dataset. Link for detailed info.
- Sarisha Das
- Shrabanti Bala Joya
Generative AI tools (Google Gemini, OpenAI ChatGPT, Anthropic Claude and GitHub Copilot) were used to assist with code generation and documentation drafting. All generated content was reviewed and edited by the authors to ensure accuracy and quality.