This document details the step-by-step processes for the system's two main phases: Offline Ingestion and Online Retrieval.
The goal of this pipeline is to prepare the ACL Anthology data for semantic search. This is a batch process run periodically (or once).
flowchart LR
A[Start] --> B[Download Data]
B --> C[Preprocess Text]
C --> D[Generate Embeddings]
D --> E[Index in Qdrant]
E --> F[Ready for Search]
- Source: Uses the
acl-anthologyPython library or direct XML/gzip downloads. - Action: Iterates through the anthology collection.
- Output: Raw metadata objects (Title, Abstract, Authors, Year, ID).
- Input: Raw metadata.
- Actions:
- Normalization: Unicode normalization (NFKC).
- Cleaning: Removal of LaTeX artifacts, excessive whitespace, and special characters.
- Filtering: Drops entries without abstracts (as they cannot be embedded).
- Output: Cleaned JSON dataset.
- Input: Cleaned abstracts.
- Model:
nomic-ai/nomic-embed-text-v1.5(or configured model). - Process:
- Batches text inputs.
- Computes 768-dimensional float vectors.
- Note: This is the most compute-intensive step.
- Target: Qdrant Vector Database.
- Action: Upserts (Insert/Update) points into the collection.
- Payload: The vector is stored alongside the metadata (Title, ID, etc.) so we don't need a secondary database lookup.
The goal of this pipeline is to return relevant papers for a user query in real-time.
sequenceDiagram
participant U as User
participant API as API Server
participant FP as Filter Parser (LLM)
participant LLM as LLM Service
participant DB as Qdrant
U->>API: Query: "machine translation"
API->>FP: Extract filters + semantic query + relevance
FP-->>API: parsed_filters, semantic_query, is_relevant
API->>LLM: "Reformulate this query..."
LLM-->>API: ["NMT", "seq2seq", "attention"]
loop For each sub-query
API->>DB: Search(Vector)
DB-->>API: [Candidate Papers]
end
API->>API: Aggregate (RRF Fusion)
API-->>U: SSE stream (metadata + response chunks)
- Parses the user query into:
parsed_filters(e.g., year range, authors, awards)semantic_query(the remaining semantic search intent)is_relevant(whether the system should proceed)
- If
is_relevantis false, the pipeline returns early with a helpful message.
- Checks for an ACL Anthology paper ID either as the entire query or embedded in a longer query.
- If a paper ID is detected, the pipeline looks up that paper and uses its title/abstract to generate search queries.
- Prompt: "You are an AI research assistant. Generate 3 diverse search queries for..."
- Input: User query / Paper abstract.
- Output: A list of strings representing different semantic angles.
- Each reformulated query string is embedded using the same model as ingestion.
- Searches are executed sequentially (embedding model calls are treated as non-thread-safe).
- Each search returns
kcandidates wherek = top_k * SEARCH_K_MULTIPLIER.
- Algorithm: Hybrid fusion combining Reciprocal Rank Fusion (RRF) with raw similarity scores.
- RRF contribution (per query):
rrf += 1 / (RRF_K + rank) - Final score:
score = w * avg_similarity + (1 - w) * normalized_rrf - Result: A deduplicated, re-ranked list of unique papers.
- The server returns a Server-Sent Events (SSE) stream.
- The first event contains
metadata(results, filters, reformulated queries, timestamps). - The server then streams response
chunkevents produced by the LLM response synthesizer.