Skip to content

Conversation

@vaclisinc
Copy link
Contributor

@vaclisinc vaclisinc commented Dec 3, 2025

Semantic Search Integration

Summary

Finally bringing semantic search to reality!
This PR integrates a FAISS-backed semantic search service using BGE (BAAI General Embedding) models, along with backend proxy endpoints, production-ready k8s deployment, automatic index building via datapuller, and updated catalog UX, enabling AI course search.

⚠️ IMPORTANT: Remember to add SEMANTIC_SEARCH_URL=http://semantic-search:8000 to your .env for local development!

💡 Production note: In k8s, the semantic search service runs as a separate deployment with persistent volume for FAISS indices. Index building is automatically triggered daily by the datapuller cronjob.


System Architecture

flowchart LR

    %% ---------- Frontend ----------
    subgraph Frontend
        FE_UI["Search Bar + AI Search Toggle"]
    end

    %% ---------- Node Backend ----------
    subgraph NodeBackend
        ProxyRouter["/api/semantic-search/*  (proxy router)"]
        CoursesAPI["/api/semantic-search/courses  (lightweight endpoint)"]
        GraphQLResolvers["GraphQL resolvers + hasCatalogData"]
    end

    %% ---------- Python Semantic Service ----------
    subgraph SemanticService["Semantic Search Service (FastAPI)"]
        Health["/health"]
        Refresh["/refresh  (rebuild FAISS index)"]
        Search["/search  (threshold-based semantic query)"]
        BGE["BGE Embedding Model"]
        FAISS["FAISS Index (cosine similarity)"]
    end

    %% ---------- Catalog Data Puller ----------
    subgraph CatalogData
        DataPuller["GraphQL Catalog Datapuller (k8s CronJob)"]
    end

    %% ---------- Data Flow ----------
    FE_UI -->|Search Query| CoursesAPI

    CoursesAPI -->|Forward to Python| Search

    Search -->|Generate Query Embedding| BGE
    Search -->|Vector Similarity Search| FAISS
    FAISS -->|Threshold-filtered Results| Search

    Search --> CoursesAPI --> FE_UI

    %% Index refresh / data ingestion
    DataPuller -->|Daily 4:10 AM PST| Refresh
    Refresh -->|Fetch Catalog via GraphQL| GraphQLResolvers
    Refresh -->|Generate Embeddings| BGE --> FAISS
Loading

Examples

Input: "Memory models in concurrent programming"

  • return courses like databases, operating systems, etc.
  • doesn't return biology or psychology courses just because of the word "memory."
image

Input: "how to shot a hot vlog"

image

Implementation Details

Python Semantic Search Service (FastAPI)

  • FastAPI microservice (apps/semantic-search) that:

    • Uses BGE (BAAI/bge-base-en-v1.5) embedding model optimized for retrieval tasks
    • Builds term-specific embeddings + FAISS indices from GraphQL catalog data
    • Implements threshold-based filtering (returns all results above similarity threshold, not just top-k)
    • Searches top 500 candidates for performance, then filters by threshold (default: 0.45)
  • Key endpoints:

    • /health — readiness probe showing index status
    • /refresh — rebuild FAISS index for a given year/semester
    • /search — semantic query with threshold filtering
  • Model Architecture:

    • Uses instruction prefix for queries: "Represent this sentence for searching relevant passages: {query}"
    • Course text format: SUBJECT: {subj} NUMBER: {num}\nTITLE: {title}\nDESCRIPTION: {desc}
    • FAISS IndexFlatIP with L2-normalized embeddings (cosine similarity)

Example: manually refreshing an index (local dev)

curl -X POST http://localhost:8080/api/semantic-search/refresh \
     -H 'Content-Type: application/json' \
     -d '{"year": 2026, "semester": "Spring"}' | jq

Example: running a semantic search

# Threshold-based search (returns all courses with similarity > 0.45)
curl "http://localhost:8080/api/semantic-search/search?query=deep%20reinforcement%20learning&year=2026&semester=Spring&threshold=0.45" | jq

# Response includes similarity scores for ranking
{
  "query": "deep reinforcement learning",
  "threshold": 0.45,
  "count": 12,
  "results": [
    {
      "subject": "COMPSCI",
      "courseNumber": "285",
      "score": 0.713,
      "title": "Deep Reinforcement Learning, Decision Making, and Control"
    },
    ...
  ]
}

Backend Integration (Node / Express)

  • Added semanticSearch.url configuration (nested structure for consistency)

  • Implemented lightweight proxy endpoint /api/semantic-search/courses:

    • Forwards requests to Python service
    • Returns only {subject, courseNumber, score} for efficient frontend filtering
    • Frontend maintains API response order (sorted by semantic similarity)
  • Updated GraphQL behavior:

    • Introduced hasCatalogData field for term filtering
    • Updated resolver to use terms(withCatalogData: true)

Frontend (Catalog UI)

  • AI Search toggle (✨ sparkle button) to activate semantic search mode
  • Semantic results preserve backend ordering (by similarity score)
  • Frontend maps semantic results to full course objects for display
  • Graceful fallback to fuzzy search when semantic search unavailable

Kubernetes Deployment (Production-Ready)

Infrastructure Components

Semantic Search Service (infra/app/templates/semantic-search.yaml)

  • Dedicated deployment with persistent volume (5Gi) for FAISS indices
  • ConfigMap for runtime configuration (log level, default term)
  • Service endpoint: bt-prod-semantic-search-svc:8000
  • Health check probe on /health endpoint

Backend Configuration (infra/app/templates/backend.yaml)

  • Environment variable: SEMANTIC_SEARCH_URL=http://bt-prod-semantic-search-svc:8000
  • Automatically injected via ConfigMap

Automatic Index Building (infra/app/values.yaml)

  • Daily cronjob at 4:10 AM PST (semantic-search-refresh)
  • Runs datapuller with --puller=semantic-search-refresh argument
  • Automatically rebuilds indices after catalog data updates

Deployment Flow

1. Deploy semantic-search service
   └─> PVC created for index storage
   └─> Service starts in "waiting" state (no index yet)

2. Datapuller cronjob runs daily
   └─> Fetches latest catalog data via GraphQL
   └─> Calls /refresh endpoint on semantic-search service
   └─> FAISS index built and persisted to PVC

3. Backend proxies requests
   └─> /api/semantic-search/* → semantic-search-svc:8000
   └─> Results returned to frontend

Technical Decisions

Why BGE over other models?

  • BGE (BAAI General Embedding) is specifically optimized for retrieval tasks (highest ranking on benchmark)
  • Better semantic understanding than general-purpose models (Jacky's - all-MiniLM, mpnet)

Why threshold instead of top-k?

  • Threshold-based filtering returns all relevant results, not arbitrary top-k
  • More flexible - can return 5 results for specific queries, 50 for broad queries

However, it’s not feasible to scan through thousands of courses, so I first retrieve the top 500 results and then apply a threshold to ensure the final set still meets the quality bar.

Model Options Available (hardcoded in main.py)

# Current: BAAI/bge-base-en-v1.5 (best for retrieval)
# Alternatives:
#   BAAI/bge-small-en-v1.5       (faster, 33M params)
#   BAAI/bge-large-en-v1.5       (most accurate, 335M params)
#   all-mpnet-base-v2            (general purpose, 110M params)
#   all-MiniLM-L6-v2             (fastest, 22M params)

Migration Guide

For Production (k8s)

  1. Deploy semantic search service:

    helm upgrade bt-prod ./infra/app -n bt
  2. Index will be automatically built by the daily cronjob at 4:10 AM PST

  3. Verify service health:

    kubectl exec -n bt deployment/bt-prod-backend -- \
      curl http://bt-prod-semantic-search-svc:8000/health

Next Steps

  1. Datapuller Integration DONE!
    Automatic index refresh via daily cronjob

  2. Fine-tuning for Berkeley Courses
    Collect user feedback dataset (query + relevant/irrelevant courses) to fine-tune BGE specifically for Berkeley course search

  3. Query Expansion
    Handle abbreviations (NLP → Natural Language Processing) and synonyms


Based on: Initial prototype by Jacky (last semester)
Frontend integration: @PineND
k8s deployment & auto-refresh: This PR

FastAPI service for semantic course search using FAISS and Sentence Transformers. Supports vector similarity search with index persistence to PVC storage.
Add backend routes to proxy semantic search service. Configure SEMANTIC_SEARCH_URL for service communication.
Add semantic-search-refresh puller to automatically rebuild FAISS indexes when course data updates.
Configure Kubernetes deployment with PVC for FAISS indexes, daily cronjob for auto-refresh, and docker-compose for local development.
@vaclisinc vaclisinc requested review from ARtheboss, PineND and maxmwang and removed request for PineND December 3, 2025 19:19
@vaclisinc vaclisinc added feature New feature or request Catalog pod labels Dec 3, 2025
threshold: float = 0.3,
allowed_subjects: Optional[Iterable[str]] = None,
) -> Tuple[List[Dict], TermIndex]:
entry = self._get_or_build_index(year, semester, allowed_subjects)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation behind storing the index on disk? I would assume this wouldn't be very performant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The primary motivation for persisting the index is actually local development experience. When using docker compose locally, rebuilding the FAISS index from scratch takes >1 minute every time the stack restarts.

On the server side, this optimization is less critical, since the service won't stop during time. However, to make the persistence between local and server side, I think it's still okay to keep this since I think saving the index doesn’t impact query performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Catalog pod feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants