AI Semantic course search #1015

vaclisinc · 2025-12-03T19:19:19Z

Semantic Search Integration

Summary

Finally bringing semantic search to reality!
This PR integrates a FAISS-backed semantic search service using BGE (BAAI General Embedding) models, along with backend proxy endpoints, production-ready k8s deployment, automatic index building via datapuller, and updated catalog UX, enabling AI course search.

⚠️ IMPORTANT: Remember to add SEMANTIC_SEARCH_URL=http://semantic-search:8000 to your .env for local development!

💡 Production note: In k8s, the semantic search service runs as a separate deployment with persistent volume for FAISS indices. Index building is automatically triggered daily by the datapuller cronjob.

System Architecture

flowchart LR

    %% ---------- Frontend ----------
    subgraph Frontend
        FE_UI["Search Bar + AI Search Toggle"]
    end

    %% ---------- Node Backend ----------
    subgraph NodeBackend
        ProxyRouter["/api/semantic-search/*  (proxy router)"]
        CoursesAPI["/api/semantic-search/courses  (lightweight endpoint)"]
        GraphQLResolvers["GraphQL resolvers + hasCatalogData"]
    end

    %% ---------- Python Semantic Service ----------
    subgraph SemanticService["Semantic Search Service (FastAPI)"]
        Health["/health"]
        Refresh["/refresh  (rebuild FAISS index)"]
        Search["/search  (threshold-based semantic query)"]
        BGE["BGE Embedding Model"]
        FAISS["FAISS Index (cosine similarity)"]
    end

    %% ---------- Catalog Data Puller ----------
    subgraph CatalogData
        DataPuller["GraphQL Catalog Datapuller (k8s CronJob)"]
    end

    %% ---------- Data Flow ----------
    FE_UI -->|Search Query| CoursesAPI

    CoursesAPI -->|Forward to Python| Search

    Search -->|Generate Query Embedding| BGE
    Search -->|Vector Similarity Search| FAISS
    FAISS -->|Threshold-filtered Results| Search

    Search --> CoursesAPI --> FE_UI

    %% Index refresh / data ingestion
    DataPuller -->|Daily 4:10 AM PST| Refresh
    Refresh -->|Fetch Catalog via GraphQL| GraphQLResolvers
    Refresh -->|Generate Embeddings| BGE --> FAISS

Examples

Input: "Memory models in concurrent programming"

return courses like databases, operating systems, etc.
doesn't return biology or psychology courses just because of the word "memory."

Input: "how to shot a hot vlog"

Implementation Details

Python Semantic Search Service (FastAPI)

FastAPI microservice (apps/semantic-search) that:
- Uses BGE (BAAI/bge-base-en-v1.5) embedding model optimized for retrieval tasks
- Builds term-specific embeddings + FAISS indices from GraphQL catalog data
- Implements threshold-based filtering (returns all results above similarity threshold, not just top-k)
- Searches top 500 candidates for performance, then filters by threshold (default: 0.45)
Key endpoints:
- /health — readiness probe showing index status
- /refresh — rebuild FAISS index for a given year/semester
- /search — semantic query with threshold filtering
Model Architecture:
- Uses instruction prefix for queries: "Represent this sentence for searching relevant passages: {query}"
- Course text format: SUBJECT: {subj} NUMBER: {num}\nTITLE: {title}\nDESCRIPTION: {desc}
- FAISS IndexFlatIP with L2-normalized embeddings (cosine similarity)

Example: manually refreshing an index (local dev)

curl -X POST http://localhost:8080/api/semantic-search/refresh \
     -H 'Content-Type: application/json' \
     -d '{"year": 2026, "semester": "Spring"}' | jq

Example: running a semantic search

# Threshold-based search (returns all courses with similarity > 0.45)
curl "http://localhost:8080/api/semantic-search/search?query=deep%20reinforcement%20learning&year=2026&semester=Spring&threshold=0.45" | jq

# Response includes similarity scores for ranking
{
  "query": "deep reinforcement learning",
  "threshold": 0.45,
  "count": 12,
  "results": [
    {
      "subject": "COMPSCI",
      "courseNumber": "285",
      "score": 0.713,
      "title": "Deep Reinforcement Learning, Decision Making, and Control"
    },
    ...
  ]
}

Backend Integration (Node / Express)

Added semanticSearch.url configuration (nested structure for consistency)
Implemented lightweight proxy endpoint /api/semantic-search/courses:
- Forwards requests to Python service
- Returns only {subject, courseNumber, score} for efficient frontend filtering
- Frontend maintains API response order (sorted by semantic similarity)
Updated GraphQL behavior:
- Introduced hasCatalogData field for term filtering
- Updated resolver to use terms(withCatalogData: true)

Frontend (Catalog UI)

AI Search toggle (✨ sparkle button) to activate semantic search mode
Semantic results preserve backend ordering (by similarity score)
Frontend maps semantic results to full course objects for display
Graceful fallback to fuzzy search when semantic search unavailable

Kubernetes Deployment (Production-Ready)

Infrastructure Components

Semantic Search Service (infra/app/templates/semantic-search.yaml)

Dedicated deployment with persistent volume (5Gi) for FAISS indices
ConfigMap for runtime configuration (log level, default term)
Service endpoint: bt-prod-semantic-search-svc:8000
Health check probe on /health endpoint

Backend Configuration (infra/app/templates/backend.yaml)

Environment variable: SEMANTIC_SEARCH_URL=http://bt-prod-semantic-search-svc:8000
Automatically injected via ConfigMap

Automatic Index Building (infra/app/values.yaml)

Daily cronjob at 4:10 AM PST (semantic-search-refresh)
Runs datapuller with --puller=semantic-search-refresh argument
Automatically rebuilds indices after catalog data updates

Deployment Flow

1. Deploy semantic-search service
   └─> PVC created for index storage
   └─> Service starts in "waiting" state (no index yet)

2. Datapuller cronjob runs daily
   └─> Fetches latest catalog data via GraphQL
   └─> Calls /refresh endpoint on semantic-search service
   └─> FAISS index built and persisted to PVC

3. Backend proxies requests
   └─> /api/semantic-search/* → semantic-search-svc:8000
   └─> Results returned to frontend

Technical Decisions

Why BGE over other models?

BGE (BAAI General Embedding) is specifically optimized for retrieval tasks (highest ranking on benchmark)
Better semantic understanding than general-purpose models (Jacky's - all-MiniLM, mpnet)

Why threshold instead of top-k?

Threshold-based filtering returns all relevant results, not arbitrary top-k
More flexible - can return 5 results for specific queries, 50 for broad queries

However, it’s not feasible to scan through thousands of courses, so I first retrieve the top 500 results and then apply a threshold to ensure the final set still meets the quality bar.

Model Options Available (hardcoded in `main.py`)

# Current: BAAI/bge-base-en-v1.5 (best for retrieval)
# Alternatives:
#   BAAI/bge-small-en-v1.5       (faster, 33M params)
#   BAAI/bge-large-en-v1.5       (most accurate, 335M params)
#   all-mpnet-base-v2            (general purpose, 110M params)
#   all-MiniLM-L6-v2             (fastest, 22M params)

Migration Guide

For Production (k8s)

Deploy semantic search service:
```
helm upgrade bt-prod ./infra/app -n bt
```
Index will be automatically built by the daily cronjob at 4:10 AM PST

Verify service health:

kubectl exec -n bt deployment/bt-prod-backend -- \
  curl http://bt-prod-semantic-search-svc:8000/health

Next Steps

✅ ~~Datapuller Integration~~ DONE!
Automatic index refresh via daily cronjob
Fine-tuning for Berkeley Courses
Collect user feedback dataset (query + relevant/irrelevant courses) to fine-tune BGE specifically for Berkeley course search
Query Expansion
Handle abbreviations (NLP → Natural Language Processing) and synonyms

Based on: Initial prototype by Jacky (last semester)
Frontend integration: @PineND
k8s deployment & auto-refresh: This PR

FastAPI service for semantic course search using FAISS and Sentence Transformers. Supports vector similarity search with index persistence to PVC storage.

Add backend routes to proxy semantic search service. Configure SEMANTIC_SEARCH_URL for service communication.

Add semantic-search-refresh puller to automatically rebuild FAISS indexes when course data updates.

Configure Kubernetes deployment with PVC for FAISS indexes, daily cronjob for auto-refresh, and docker-compose for local development.

maxmwang · 2025-12-05T07:08:58Z

apps/semantic-search/app/main.py

+        threshold: float = 0.3,
+        allowed_subjects: Optional[Iterable[str]] = None,
+    ) -> Tuple[List[Dict], TermIndex]:
+        entry = self._get_or_build_index(year, semester, allowed_subjects)


What's the motivation behind storing the index on disk? I would assume this wouldn't be very performant

The primary motivation for persisting the index is actually local development experience. When using docker compose locally, rebuilding the FAISS index from scratch takes >1 minute every time the stack restarts.

On the server side, this optimization is less critical, since the service won't stop during time. However, to make the persistence between local and server side, I think it's still okay to keep this since I think saving the index doesn’t impact query performance.

vaclisinc added 6 commits December 3, 2025 09:48

add semantic search service

287dd04

FastAPI service for semantic course search using FAISS and Sentence Transformers. Supports vector similarity search with index persistence to PVC storage.

proxy semantic search through backend

26ad4e9

Add backend routes to proxy semantic search service. Configure SEMANTIC_SEARCH_URL for service communication.

auto-refresh semantic search index via datapuller

7abbfeb

Add semantic-search-refresh puller to automatically rebuild FAISS indexes when course data updates.

add k8s deployment for semantic search

ea05329

Configure Kubernetes deployment with PVC for FAISS indexes, daily cronjob for auto-refresh, and docker-compose for local development.

frontend support for semantic search button

81b500d

fix: docker refactor

285e73c

vaclisinc requested review from ARtheboss, PineND and maxmwang and removed request for PineND December 3, 2025 19:19

vaclisinc added feature New feature or request Catalog pod labels Dec 3, 2025

maxmwang reviewed Dec 5, 2025

View reviewed changes

Merge branch 'main' into semantic-search-clean

499709d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AI Semantic course search #1015

AI Semantic course search #1015

Uh oh!

vaclisinc commented Dec 3, 2025 •

edited

Loading

Uh oh!

maxmwang Dec 5, 2025

Uh oh!

vaclisinc Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AI Semantic course search #1015

Are you sure you want to change the base?

AI Semantic course search #1015

Uh oh!

Conversation

vaclisinc commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Semantic Search Integration

Summary

System Architecture

Examples

Implementation Details

Python Semantic Search Service (FastAPI)

Example: manually refreshing an index (local dev)

Example: running a semantic search

Backend Integration (Node / Express)

Frontend (Catalog UI)

Kubernetes Deployment (Production-Ready)

Infrastructure Components

Deployment Flow

Technical Decisions

Why BGE over other models?

Why threshold instead of top-k?

Model Options Available (hardcoded in main.py)

Migration Guide

For Production (k8s)

Next Steps

Uh oh!

maxmwang Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

vaclisinc Dec 13, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vaclisinc commented Dec 3, 2025 •

edited

Loading

Model Options Available (hardcoded in `main.py`)