A Streamlit app that performs semantic search over a synthetic product catalog using a persistent Chroma vector database. Demonstrates the difference between exact keyword search and meaning-based retrieval — and shows how a real vector DB adds persistence and metadata filtering on top of raw similarity search.
Where PoC 1 used FAISS as an in-process library (great for prototypes, no persistence, no metadata), this PoC uses a proper vector database. Same MiniLM embeddings; very different ergonomics.
By reading the code and running the app, you will see, end-to-end:
- Why keyword search fails (
"cosy office sweater"matches nothing literal) but vector search succeeds (it finds "Soft, breathable hoodie perfect for the office"). - How a persistent vector store survives restarts — embedding the catalog runs once, then every subsequent launch is instant.
- How metadata filtering (
where={"category": "Apparel"}) combines structured filters with semantic similarity in a single query. - How a synthetic dataset can be generated programmatically so the app works on first run without any external data.
- Why cosine similarity is the default metric for normalised text embeddings (and how Chroma exposes it).
This PoC was generated by pasting the following prompt into VS Code's Copilot Chat in Agent mode. It is reproduced verbatim from the lecture slides so you can paste it yourself and compare.
Build a single-file Streamlit app
app.pythat performs semantic search over a product catalog using Chroma as a persistent vector store. Create a helpersample_data.pywith a functionget_products()that generates ~2,000 synthetic products (id, title, description, category, price) using numpy/pandas, and writes them todata/products.csv. On startup, if the Chroma collectionproducts(path./db) is empty, embed each product description withsentence-transformers all-MiniLM-L6-v2and add it with metadata{category, price, title}; otherwise reuse the persisted collection. The UI: a text input "Search" and a sidebarst.selectbox"Category" (with "All"). On submit, embed the query, runcollection.query(n_results=10, where={...})with the category filter, and show a ranked table of title, category, price, and similarity score. Providerequirements.txt(streamlit, chromadb, sentence-transformers, pandas, numpy),.gitignore(.venv/,db/,data/), and a short README.
📝 Small additions in the committed code: a Top-k slider, a minimum similarity threshold, and a clean column display for prices and similarity scores. These are obvious next iterations the agent can add by request.
The app has two phases — indexing (runs once, then is cached on disk forever) and querying (runs on every search).
flowchart TB
subgraph IDX["⏱ Phase 1 — Indexing (once, persisted to ./db)"]
direction LR
SEED["sample_data.py<br/>get_products(n=2000)"] --> CSV["data/products.csv"]
CSV --> READ["pandas.read_csv()"]
READ --> EMB1["MiniLM-L6-v2<br/>encode(descriptions)<br/>(L2-normalised)"]
EMB1 --> ADD["collection.add(<br/>ids, documents,<br/>embeddings, metadatas<br/>)"]
ADD --> DB[("./db/<br/>Chroma<br/>PersistentClient")]
end
subgraph QRY["💬 Phase 2 — Query (per search)"]
direction LR
Q["🔎 Search box"] --> EMB2["MiniLM-L6-v2<br/>encode(query)"]
F["📂 Category<br/>(sidebar)"] --> WHERE["where={category: …}"]
EMB2 --> QUERY["collection.query(<br/>q_vec, n_results=k,<br/>where=WHERE<br/>)"]
WHERE --> QUERY
DB -.-> QUERY
QUERY --> RANK["Ranked results<br/>title · category · price ·<br/>similarity = 1 − distance"]
end
style CSV fill:#dbeafe,stroke:#1e40af
style Q fill:#dbeafe,stroke:#1e40af
style F fill:#dbeafe,stroke:#1e40af
style DB fill:#fef3c7,stroke:#b45309
style RANK fill:#bbf7d0,stroke:#15803d
Persistence is the headline feature. Restart the app and the embedding step is skipped — Chroma reloads the index from
./db/in milliseconds.
sample_data.py — synthetic catalog generator
Why generate data instead of bundling a CSV? To make the PoC work on
first run with no external download, and to let you regenerate with
different sizes (n) and seeds.
| Symbol | Responsibility |
|---|---|
CATEGORIES |
Five product categories with seed nouns (sweater, headphones, candle, …). |
ADJECTIVES, USE_CASES |
Vocabulary for plausible titles and descriptions. |
get_products(n=2000, seed=7) |
Sample one category + adjectives + use case per row, build a (id, title, description, category, price) DataFrame, write it to data/products.csv, return it. |
app.py — Streamlit UI + Chroma integration
| Section | Functions | Responsibility |
|---|---|---|
| Cached resources | load_embedder(), get_chroma_collection(), load_catalog() |
Create the MiniLM model, the PersistentClient(path="./db"), and the cached DataFrame once per session. |
| Indexing | index_catalog() |
Embed descriptions in batches of 256, then collection.add(...) with ids, docs, embeddings, and metadata. Shows a Streamlit progress bar. |
| Streamlit UI | main() |
Builds the sidebar (Category selectbox, Top-k slider, similarity threshold), the search box, and the results table with st.dataframe (price as currency, similarity as a ProgressColumn). |
sequenceDiagram
actor User
participant ST as Streamlit UI
participant Coll as Chroma collection
participant Disk as ./db/ (persisted)
Note over ST: app boot
ST->>Coll: get_or_create_collection("products")
Coll->>Disk: load existing index (if any)
alt collection empty (first run)
ST->>ST: load_catalog() → DataFrame
loop batch of 256
ST->>ST: embed descriptions
ST->>Coll: collection.add(ids, docs, vecs, metas)
end
Coll->>Disk: persist
else cached
Note over ST,Coll: skip embedding entirely
end
User->>ST: type query + pick category
ST->>ST: embed query
ST->>Coll: query(q_vec, n_results=k, where={category})
Coll-->>ST: docs, metadatas, distances
ST-->>User: ranked dataframe
cd poc2_vector_search
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtNo API key needed. Both the embedding model and the vector database run locally. This makes PoC 2 a great offline demo.
streamlit run app.py- The first launch writes
data/products.csv, builds embeddings, and persists./db/(≈ 30–60 s). - Subsequent launches reuse
./db/and start in under a second.
To force a rebuild from scratch:
rm -rf db/ data/
streamlit run app.pyThe catalog is synthesised on first run by sample_data.py — no download, no API key. Each row looks like this:
id,title,description,category,price
P00001,Soft Merino Hoodie,A breathable wool hoodie perfect for the office on chilly mornings.,Apparel,89.0
P00002,Trail Runner GPS Watch,Lightweight GPS watch built for marathon training and ultra runs.,Sports,249.0
P00003,Linen Bedside Lamp,A quiet warm-light lamp ideal for the bedroom.,Home,59.0
…To inspect or tweak it before launching the app:
python -c "from sample_data import get_products; print(get_products(n=5))"
# regenerate with a different seed / size:
python -c "from sample_data import get_products; get_products(n=500, seed=42)"Open the app, leave Category = All for the first three queries, then toggle the filter for the last two.
| # | Query | Category filter | What you should see |
|---|---|---|---|
| 1 | cosy office sweater |
All | Top results are hoodies and sweaters from Apparel; similarities ≈ 0.50–0.60. The word cosy may not appear at all in the descriptions — MiniLM has matched cosy ≈ soft / breathable. |
| 2 | noise cancelling headphones |
All | Electronics dominate. Similarities ≈ 0.55–0.70 (keyword + semantics agree → highest scores in the demo). |
| 3 | gift for a runner |
All | Cross-category mix: Sports (watches, shoes), Apparel (running tops), maybe Electronics (earbuds). |
| 4 | gift for a runner |
Sports | Same query, but now every row has category = Sports. Similarities may drop slightly — that's the cost of restricting the candidate pool. |
| 5 | xkcd asdf qwerty |
All | Garbage-in test. With Min similarity = 0.0 you'll still see results (low scores). Raise the slider to 0.4 and the table empties out — that's the explicit refusal pattern. |
💡 Look at the similarity column, not just the titles. A drop from 0.65 → 0.30 across the top-k tells you "the model is reaching" — often more useful than the rank itself.
Below is a script you can follow to verify all four properties — vector search, persistence, metadata filtering, and similarity ordering.
Type:
cosy office sweater
- Expected: the top results include items described as "soft", "breathable", "perfect for the office" — many of them with no occurrence of the word "cosy" or "sweater" in the description.
- Why this works: MiniLM has learned that cosy ≈ soft and sweater ≈ hoodie in vector space.
- Note the time it takes to start the first time (look at the "Building Chroma index" status box).
- Stop the app (
Ctrl-C) and re-runstreamlit run app.py. - Expected: the status box does not appear, the sidebar shows "Indexed items: ~2000" immediately, and queries return instantly.
- Verify on disk:
ls db/shows the persisted Chroma files.
- Type a generic query like
present for someone. - With Category = All, observe that results span multiple categories (Books, Home, Sports, …).
- Set Category = Sports in the sidebar and submit again.
- Expected: every row in the results table has
category = Sports. Thesimilaritycolumn may drop slightly because we restricted the candidate pool — that's correct.
- Set Min similarity to
0.0and run a vague query likestuff. - Note that low-relevance items still appear with similarities around 0.1–0.2.
- Raise the threshold to
0.4. - Expected: the table shrinks to (or empties out, with a warning). This is the "explicit refusal" pattern in retrieval — better to show nothing than to show garbage.
Try the four queries below and inspect what comes back. Each is designed to exercise a different aspect of semantic similarity.
| Query | What it tests |
|---|---|
something to listen to music on the go |
Synonym matching (no "music" in many descriptions). |
gift for a runner |
Cross-category surfacing (Sports ∪ Electronics ∪ Apparel). |
quiet bedroom decoration |
Category bleed (Home items dominate). |
noise cancelling headphones |
Where keyword and semantics agree. |
| Symptom | Likely cause | Fix |
|---|---|---|
| First launch is slow (~ 1 min) | Embedding 2,000 descriptions on CPU. | Wait. The progress bar shows you the rate. |
OperationalError: database is locked |
Two Streamlit processes opened the same ./db/. |
Stop one, or run them with --server.port to separate them and a different DB_PATH. |
| Sidebar still shows "Indexed items: 0" after a run | The CSV has more rows than the collection — re-indexing was triggered but failed half-way. | Delete db/ and try again. |
| Embeddings download stalls | First run pulls MiniLM (~ 80 MB) from HuggingFace. | Check your network; subsequent runs are cached. |
All exposed at the top of app.py:
EMBED_MODEL_NAME = "all-MiniLM-L6-v2" # try BAAI/bge-small-en-v1.5 for higher quality
COLLECTION_NAME = "products"
DB_PATH = "./db"
CSV_PATH = Path("data/products.csv")And the runtime knobs (sidebar):
- Category filter (
Allor one of five). - Top-k results.
- Min similarity threshold.
To change the distance metric, edit get_chroma_collection():
metadata={"hnsw:space": "cosine"} # or "l2", "ip"- Hybrid search. Combine vector results with BM25 keyword results
(
rank_bm25) and merge the two rankings. This is the production sweet spot. - Domain-tuned embeddings. Swap MiniLM for a domain-specific model (legal, biomedical, code) and re-build the index — observe the recall/precision change.
- Real catalog. Replace
sample_data.get_products()with a loader for an actual e-commerce dataset (e.g. the Amazon Reviews 2023 metadata). - Add price filtering. Chroma's
where={"price": {"$lte": 100}}syntax already supports it — add a sidebar slider. - Wire it into the agent. Expose this collection as a tool in PoC 3 so the agent can answer "find me a cosy sweater under $50".
- app.py — the Streamlit UI and Chroma integration.
- sample_data.py — synthetic catalog generator.
- requirements.txt —
streamlit,chromadb,sentence-transformers,pandas,numpy. - .gitignore — excludes
.venv/,db/,data/.