Skip to content

UBC-MDS/DSCI_575_project_sbj1_rishadaz

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

137 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Amazon Grocery & Gourmet Food Search

This project builds an information retrieval system over the Amazon Reviews 2023 dataset, focusing on the Grocery & Gourmet Food category. Given a natural language query, the system retrieves the most relevant products using two complementary approaches:

  • BM25 β€” a classical keyword-based retrieval method using term frequency and inverse document frequency. This project uses LangChain BM25 retriever.
  • Semantic Search β€” dense vector retrieval using sentence embeddings (all-MiniLM-L6-v2) and a FAISS index
  • Hybrid Search β€” combines BM25 and Semantic Search using Reciprocal Rank Fusion (RRF) to produce a more robust and balanced ranking of documents

The app allows users to switch between BM25 and Semantic search modes, view top-3 product results with reviews, and provide relevance feedback (πŸ‘/πŸ‘Ž) which is recorded in a CSV file. An AI Assistant tab powered by a Hybrid RAG pipeline (BM25 + Semantic + Llama-3-8B) is also available for natural language product recommendations. Additionally, the system includes a web search tool (Tavily) that augments LLM responses with live external information (e.g. current pricing, availability) when the query requires it. Both the BM25 index (603,274 products) and the FAISS semantic index (100,500 products, smaller due to computing constraints) are hosted on HuggingFace and loaded automatically by the app.

Live App: πŸ₯•πŸ§€ Grocery & Gourmet Food Search

Dataset

Source: McAuley-Lab/Amazon-Reviews-2023 on HuggingFace

Category: Grocery & Gourmet Food

File Rows Description
raw_review_Grocery_and_Gourmet_Food 14,318,520 User reviews
raw_meta_Grocery_and_Gourmet_Food 603,274 Product metadata

We combine both datasets to enhance the search functionality with as much useful information as possible. We create langchain documents using important fields in the metadata such as title, description, categories, while fields like title and text from user reviews are used in combination. The field parent_asin is used as a combining factor between the two datasets. For more information please check the notebooks relating to data preprocessing. Major preprocessing steps involve extracting useful columns from the data, creating documents, and converting them to tokens for bm25 or embeddings for FAISS.

We utilise HuggingFace datasets package which enable us to use arrow-like SQL search using duckdb.

Retrieval Workflow

We use a sparse method (BM25), a dense method (FAISS), and a hybrid combination of the two. For more details please check notebooks related to bm25 and faiss.

BM25

BM25 is a sparse, keyword-based retrieval method. At query time, the query is tokenized using the same tokenizer used during preprocessing. The BM25 index scores all documents based on term frequency and inverse document frequency (TF-IDF-like), returning the top-k most relevant documents ranked by score. We return and display this score in the app. Scores are computed via vectorizer.get_scores(), which returns a numpy array of BM25 scores across all documents in the index.

Sample corpus is provided in data/processed/tokenisation

flowchart LR
    reviews[("Reviews")] --> docs["Documents"]
    metadata[("Metadata")] --> docs
    docs --> tokens["Tokens\n(BM25 Index)"]
    tokens --> scores["BM25 Scores\n(numpy array)"]
    query(["Query"]) --> qtokens["Query\nTokens"]
    qtokens --> scores
    scores --> topk["Top-k\nDocuments"]
    topk --> output["Output JSON\n(metadata)"]
    output --> app["App\nHTML"]
Loading

FAISS

FAISS (Facebook AI Similarity Search) is a dense, embedding-based retrieval method. At query time, the query is converted into a vector embedding using the same embedding model used during preprocessing. FAISS then performs an approximate nearest neighbour (ANN) search over the indexed document embeddings, returning the top-k most semantically similar documents by cosine similarity. We return and display this score in the app.

Sample embeddings are provided in data/processed/embeddings

flowchart LR
    reviews[("Reviews")] --> docs["Documents"]
    metadata[("Metadata")] --> docs
    docs --> embeddings["Embeddings\n(FAISS Index)"]
    embeddings --> similar["Most Similar\nEmbeddings"]
    query(["Query"]) --> qembed["Query\nEmbedding"]
    qembed --> similar
    similar --> output["Output JSON\n(metadata)"]
    output --> app["App\nHTML"]
Loading

Hybrid

We merge BM25 and FAISS into a hybrid retriever using Reciprocal Rank Fusion (RRF), which combines semantic similarity with keyword-based relevance to produce a more robust and balanced ranking. RRF assigns each document a score based on its rank position in each retriever's results, then sums these scores with configurable weights, allowing us to control the trade-off between contextual understanding and exact term matching. The hybrid retriever is implemented in src/hybrid.py and is the retriever used by the AI Assistant tab.

flowchart LR
    query["query"] --> sem["FAISS retriever"]
    query --> bm["BM25 retriever"]
    sem --> semtop["Top-k semantic"]
    bm --> bmtop["Top-k BM25"]
    semtop --->|50% weight| comb["RRF Combined Output docs"]
    bmtop --->|50% weight| comb
    comb --> output["Output JSON\n(metadata)"]
    output --> app["App\nHTML"]
Loading

Setup

  1. Clone the repository using HTTP
git clone https://github.com/UBC-MDS/DSCI_575_project_sbj1_rishadaz.git

Or, SSH

git clone git@github.com:UBC-MDS/DSCI_575_project_sbj1_rishadaz.git

Navigate to the project root

cd DSCI_575_project_sbj1_rishadaz
  1. Create and activate the environment
conda env create -f environment.yml
conda activate 575-proj
  1. Set up environment variables
touch .env
# Add your HuggingFace token to .env:
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxx

# Add your Tavily API key for web search augmentation (free tier at tavily.com):
TAVILY_API_KEY=tvly_xxxxxxxxxxxxxxxxxx

# To run on sample pkl/faiss files already present in github repo
DATA_SOURCE=local
# To run on full generated files from HF Datasets (takes longer time to download)
DATA_SOURCE=remote
  1. Build the indices (optional β€” only needed for full local index)

The app has the option to load the full indices from HuggingFace automatically. To build locally:

BM25:

# Open and run notebooks/milestone1_bm25.ipynb
# This downloads the dataset and saves to data/processed/bm25_index.pkl

Semantic:

# Open and run notebooks/milestone1_semantic.ipynb
# This saves to data/processed/embeddings/
  1. Run the app locally
streamlit run app/app.py

The app will automatically use the full local index if available, otherwise falls back to the smaller subset files in data/processed/.

Evaluation

Evaluation and exploration can be generated here and are summarized here. We can see a few cases where BM25 was doing better, while in some FAISS was. We cannot compare the scores between them as they are on different scales, but we are able to see how they prioritize items. We have not implemented scoring based on other factors such as popularity or rating, and only rank the products based on their retrieval score.

RAG pipeline evaluation is generated here and summarized here.

RAG and LLM Integration

We query the online hosted LLMs through the HuggingFace API and selected meta-llama/Meta-Llama-3-8B-Instruct with a max_token limit of 512 tokens. Since our app is small scale, we expect to have enough free-tier API calls available, and the LLM performance was quite good over many iterations we tested.

The AI Assistant tab in the app exposes the full RAG pipeline to the user β€” enter a grocery query and receive AI-generated product recommendations along with recipe ideas and storage tips, grounded in the retrieved product reviews and metadata.

We tested 2 kinds of retriever for the RAG - a fully semantic (FAISS) retriever and a hybrid (BM25 + semantic) retriever with equal weights. We found the hybrid retriever to work better overall and it is the sole RAG retriever used in the AI Assistant tab. In the future we can implement a slider to control the ratio of weights used in the hybrid retriever.

Both semantic and hybrid can be explored in this notebook with different prompts and parameters. The rag_pipeline object returns a tuple, where the second item will return the context retrieved from the retriever, so both can be tested simultaneously. The input verbose=True can also print the entire context which is being sent to the LLM, after each step, for more clear exploration.

Here is the basic workflow:

flowchart LR
    reviews["Retriever\n(Hybrid or Semantic)"] --> docs["Top k\nDocuments"]
    docs --> embeddings["Create\nPage Context"]
    embeddings --> similar["Prompt"]
    sys_pro["SYSTEM Prompt"] --> similar
    query(["User"]) --> qembed["Query"]
    qembed --> reviews
    qembed --> similar
    similar --> response["LLM response"] --> output["Output JSON\n(content + metadata)"]
Loading

Web Search Tool

The RAG pipeline includes an optional web search tool powered by Tavily. When a query contains keywords suggesting current or external information is needed (e.g. price, current, gluten, organic, nutrition), the system fetches live web results and injects them into the LLM context. Source URLs are displayed as clickable links in the 🌐 Web Sources section of the AI tab. The tool requires a TAVILY_API_KEY in .env (free tier available at tavily.com). If the key is missing or the call fails, the pipeline continues normally without web augmentation.

LLM Evaluation

Similar to the Search function, some metrics and exploration can be generated here and are summarized here. We found that while the LLM was slightly unpredictable, for most simple grocery queries it performed well. We tried to depend as little as possible on the output formatting to avoid breaking of code in edge cases, e.g. when the LLM does not return the parent_asin numbers.

Disclaimer LLM-based pipelines may occasionally produce inaccurate or unexpected results. Since this application handles food and recipe-related queries, any guidance on cooking, storage, or handling should be independently verified before use. Prompting should be done carefully to avoid hallucinations.

App Deployment

The app is deployed on HuggingFace Spaces at the link provided in the description. The Dataset with both bm25 and faiss embeddings are also hosted as a HuggingFace Dataset. Link for detailed info.

Authors

  • Sarisha Das
  • Shrabanti Bala Joya

Attribution

Generative AI tools (Google Gemini, OpenAI ChatGPT, Anthropic Claude and GitHub Copilot) were used to assist with code generation and documentation drafting. All generated content was reviewed and edited by the authors to ensure accuracy and quality.

About

Information retrieval system over the Amazon Reviews 2023 dataset, focusing on the Grocery & Gourmet Food πŸ°πŸ―πŸ’

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors