This document provides a detailed breakdown of the ACL Anthology RAG system architecture, including its major components, design patterns, and data flow.
The system is designed as a modular Retrieval-Augmented Generation (RAG) pipeline. It decouples the offline ingestion of data from the online retrieval of information, allowing for scalable and efficient searching.
graph TD
subgraph "Frontend"
UI[React UI]
end
subgraph "Backend API"
API[FastAPI Router]
QP[Query Processor]
Agg[Result Aggregator]
end
subgraph "Core Logic"
Ref[LLM Reformulator]
Emb[Embedding Service]
Ret[Retrieval Pipeline]
end
subgraph "Data Storage"
Qdrant[Qdrant Vector DB]
Raw[Raw ACL Data]
end
UI <--> API
API --> QP
QP --> Ref
Ref --> Emb
Emb --> Ret
Ret <--> Qdrant
Ret --> Agg
Agg --> API
- Tech Stack: React 19, Vite, TailwindCSS, shadcn/ui.
- Responsibilities:
- Accepts user input (Natural Language or Paper ID).
- Displays retrieved papers with metadata (Title, Abstract, Year, Authors).
- Visualizes the search process (loading states, query expansion results).
- Key Components:
SearchInterface: Main input component handling mode switching.ResultCard: Displays individual paper details.
- Tech Stack: FastAPI, Python 3.12.
- Responsibilities:
- Exposes REST endpoints for search.
- Handles request validation and error management.
- Orchestrates the retrieval pipeline.
- Key Modules:
api/app.py: FastAPI application entrypoint (Vercel-friendly).api/src/api/routes.py: Defines endpoints likePOST /api/search(SSE) andGET /api/paper/{paper_id}.
- Role: Interprets the raw user input.
- Logic:
- If input is a Paper ID (e.g.,
2023.acl-long.1), the pipeline looks up the paper in the Qdrant payload and uses its title/abstract as the basis for generating search queries. - If input is Natural Language, it passes it directly to the next stage.
- If input is a Paper ID (e.g.,
- Design Pattern: Strategy Pattern (handles different input types uniformly).
- Role: Extracts structured filters (year/authors/etc.), a remaining semantic query, and performs permissive relevance gating.
- Notes:
- Runs before retrieval and can return early for clearly irrelevant queries.
- Role: Expands the semantic query into multiple search queries.
- Logic:
- Uses Groq LLM and returns a JSON list of query strings.
- Improves recall by covering different facets of the information need.
- Configuration: Adjustable number of generated queries (default: 3).
- Role: Converts text into dense vectors.
- Provider: Fireworks embeddings via LangChain.
- Model: Defaults to
nomic-ai/nomic-embed-text-v1.5. - Usage:
- Offline: Embeds all paper abstracts.
- Online: Embeds the reformulated search queries.
- Role: Executes the search against the vector database.
- Logic:
- Performs nearest neighbor search for each of the reformulated queries.
- Retrieves
kcandidates for each query.
- Role: Merges results from multiple queries into a single ranked list.
- Algorithm: Reciprocal Rank Fusion (RRF).
- Logic:
- Assigns scores based on the rank of a paper in each result set.
- Favors papers that appear in multiple result sets or rank highly in single sets.
- Deduplicates papers.
- Role: Streams a markdown response grounded in the retrieved results.
- Notes:
- Uses bracketed numeric citations like
[1],[2]aligned to the ranked results.
- Uses bracketed numeric citations like
- Tech Stack: Qdrant (Dockerized).
- Role: Stores abstract embeddings and metadata (Title, URL, Year).
- Schema:
vector: 768-dim float array.payload: JSON object with paper metadata.
While typically used for generating answers, here the "Generation" part is primarily used for Query Expansion, and the final output is the retrieved documents themselves. This is often called RAG for Retrieval.
Used to combine results from the multiple generated queries without needing complex re-ranking models. It provides a robust way to fuse rankings.
- Ingestion is strictly offline.
- Retrieval is strictly read-only online.
- Shared Components (like Embedding) are reused to ensure consistency between indexing and querying.
- Download:
acl-anthologylibrary fetches metadata. - Clean: Text is normalized (unicode normalization, whitespace stripping).
- Embed: Batches of abstracts are sent to the embedding model.
- Upsert: Vectors + Payload are pushed to Qdrant.
- Input: User provides "Machine Translation".
- Parse filters: LLM extracts structured filters + semantic search intent.
- Reformulate: LLM generates:
- "Neural Machine Translation state of the art"
- "Low-resource language translation"
- "Transformer based translation models"
- Embed: All strings are embedded.
- Search: Qdrant runs multiple searches (one per query).
- Fuse: Hybrid RRF + score fusion combines the ranked lists.
- Return: Results + response are streamed to the UI via SSE.
- New Embedding Models: Change
EMBEDDING_MODELin config. - Different LLMs: Switch providers in
api/src/llm/. - Hybrid Search: Add keyword-based search (BM25) to Qdrant and fuse with dense vectors.