-
Notifications
You must be signed in to change notification settings - Fork 14
AI Semantic course search #1015
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
FastAPI service for semantic course search using FAISS and Sentence Transformers. Supports vector similarity search with index persistence to PVC storage.
Add backend routes to proxy semantic search service. Configure SEMANTIC_SEARCH_URL for service communication.
Add semantic-search-refresh puller to automatically rebuild FAISS indexes when course data updates.
Configure Kubernetes deployment with PVC for FAISS indexes, daily cronjob for auto-refresh, and docker-compose for local development.
| threshold: float = 0.3, | ||
| allowed_subjects: Optional[Iterable[str]] = None, | ||
| ) -> Tuple[List[Dict], TermIndex]: | ||
| entry = self._get_or_build_index(year, semester, allowed_subjects) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the motivation behind storing the index on disk? I would assume this wouldn't be very performant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The primary motivation for persisting the index is actually local development experience. When using docker compose locally, rebuilding the FAISS index from scratch takes >1 minute every time the stack restarts.
On the server side, this optimization is less critical, since the service won't stop during time. However, to make the persistence between local and server side, I think it's still okay to keep this since I think saving the index doesn’t impact query performance.
Semantic Search Integration
Summary
Finally bringing semantic search to reality!
This PR integrates a FAISS-backed semantic search service using BGE (BAAI General Embedding) models, along with backend proxy endpoints, production-ready k8s deployment, automatic index building via datapuller, and updated catalog UX, enabling AI course search.
System Architecture
flowchart LR %% ---------- Frontend ---------- subgraph Frontend FE_UI["Search Bar + AI Search Toggle"] end %% ---------- Node Backend ---------- subgraph NodeBackend ProxyRouter["/api/semantic-search/* (proxy router)"] CoursesAPI["/api/semantic-search/courses (lightweight endpoint)"] GraphQLResolvers["GraphQL resolvers + hasCatalogData"] end %% ---------- Python Semantic Service ---------- subgraph SemanticService["Semantic Search Service (FastAPI)"] Health["/health"] Refresh["/refresh (rebuild FAISS index)"] Search["/search (threshold-based semantic query)"] BGE["BGE Embedding Model"] FAISS["FAISS Index (cosine similarity)"] end %% ---------- Catalog Data Puller ---------- subgraph CatalogData DataPuller["GraphQL Catalog Datapuller (k8s CronJob)"] end %% ---------- Data Flow ---------- FE_UI -->|Search Query| CoursesAPI CoursesAPI -->|Forward to Python| Search Search -->|Generate Query Embedding| BGE Search -->|Vector Similarity Search| FAISS FAISS -->|Threshold-filtered Results| Search Search --> CoursesAPI --> FE_UI %% Index refresh / data ingestion DataPuller -->|Daily 4:10 AM PST| Refresh Refresh -->|Fetch Catalog via GraphQL| GraphQLResolvers Refresh -->|Generate Embeddings| BGE --> FAISSExamples
Input: "Memory models in concurrent programming"
Input: "how to shot a hot vlog"
Implementation Details
Python Semantic Search Service (FastAPI)
FastAPI microservice (
apps/semantic-search) that:Key endpoints:
/health— readiness probe showing index status/refresh— rebuild FAISS index for a given year/semester/search— semantic query with threshold filteringModel Architecture:
"Represent this sentence for searching relevant passages: {query}"SUBJECT: {subj} NUMBER: {num}\nTITLE: {title}\nDESCRIPTION: {desc}Example: manually refreshing an index (local dev)
curl -X POST http://localhost:8080/api/semantic-search/refresh \ -H 'Content-Type: application/json' \ -d '{"year": 2026, "semester": "Spring"}' | jqExample: running a semantic search
Backend Integration (Node / Express)
Added
semanticSearch.urlconfiguration (nested structure for consistency)Implemented lightweight proxy endpoint
/api/semantic-search/courses:{subject, courseNumber, score}for efficient frontend filteringUpdated GraphQL behavior:
hasCatalogDatafield for term filteringterms(withCatalogData: true)Frontend (Catalog UI)
Kubernetes Deployment (Production-Ready)
Infrastructure Components
Semantic Search Service (
infra/app/templates/semantic-search.yaml)bt-prod-semantic-search-svc:8000/healthendpointBackend Configuration (
infra/app/templates/backend.yaml)SEMANTIC_SEARCH_URL=http://bt-prod-semantic-search-svc:8000Automatic Index Building (
infra/app/values.yaml)semantic-search-refresh)--puller=semantic-search-refreshargumentDeployment Flow
Technical Decisions
Why BGE over other models?
Why threshold instead of top-k?
Model Options Available (hardcoded in
main.py)Migration Guide
For Production (k8s)
Deploy semantic search service:
Index will be automatically built by the daily cronjob at 4:10 AM PST
Verify service health:
kubectl exec -n bt deployment/bt-prod-backend -- \ curl http://bt-prod-semantic-search-svc:8000/healthNext Steps
✅
Datapuller IntegrationDONE!Automatic index refresh via daily cronjob
Fine-tuning for Berkeley Courses
Collect user feedback dataset (query + relevant/irrelevant courses) to fine-tune BGE specifically for Berkeley course search
Query Expansion
Handle abbreviations (NLP → Natural Language Processing) and synonyms
Based on: Initial prototype by Jacky (last semester)
Frontend integration: @PineND
k8s deployment & auto-refresh: This PR