ACL Anthology Semantic Retrieval System

Motivation

The ACL Anthology hosts tens of thousands of NLP research papers. Traditional keyword-based search often fails to capture the semantic nuance of research queries, struggling with synonyms, paraphrases, and conceptual similarity.

ACL Anthology RAG bridges this gap by implementing a semantic retrieval system. It moves beyond simple keyword matching to understand the meaning behind a query, allowing researchers to discover relevant work even when they don't know the exact terminology.

Key Features

Semantic Search: Uses dense vector embeddings to find conceptually similar papers.
Query Reformulation: Leverages LLMs to expand user queries into multiple search vectors, improving recall.
Dual Query Modes: Supports both natural language questions and "Paper as Query" (using a paper ID to find related work).
Unified Pipeline: A consistent architecture for handling different types of inputs.
Modern Stack: FastAPI, LangChain, Fireworks (embeddings), Groq (LLM), Qdrant, React.
Interactive UI: Clean, responsive interface with real-time streaming responses and inline citations.

Architecture Overview

The system follows a Retrieval-Augmented Generation (RAG) pattern, though currently focused on the retrieval aspect.

Note: Fireworks is used only for embedding generation. LLM-based query reformulation uses Groq.

Demo

The web interface provides:

Real-time streaming of search responses as they're generated
Inline citations with hover previews showing paper details, authors, and PDF links
Monitoring panel displaying query processing pipeline, filters, and timing information
Clean, responsive design optimized for research workflows

Supported Query Modes

1. Natural Language Query

Best for: Exploratory research, topic discovery.

Input: "How do I improve low-resource translation?"
Process: The system expands this into multiple semantic queries (e.g., "low-resource NMT techniques", "data augmentation for translation").

2. ACL Anthology Paper ID

Best for: Finding related work, literature review.

Input: 2023.acl-long.412
Process: The system fetches the abstract of the specified paper and uses it as a semantic proxy to find other papers in the same research neighborhood.

Offline vs Online Steps

Offline (Ingestion)

Download: Metadata and abstracts are fetched from the ACL Anthology.
Preprocess: Text is cleaned, normalized, and formatted.
Embed: Dense vectors are generated for each abstract.
Index: Vectors are stored in Qdrant for fast retrieval.

Online (Retrieval)

Receive: User input (text or ID) is received.
Reformulate: LLM generates multiple search queries.
Retrieve: Vector search finds candidate papers for each query.
Aggregate: Reciprocal Rank Fusion (RRF) combines and ranks results.

Quick Start

Prerequisites

Python 3.12+
Node.js 20+
Access to a Qdrant endpoint (local install or Qdrant Cloud)

Steps

Clone the repository

git clone https://github.com/nnayz/acl-anthology-rag.git
cd acl-anthology-rag

Configure Environment Copy .env.example to .env in api/ and fill in your API keys (Groq/Fireworks).
```
cp api/.env.example api/.env
```
Then set Qdrant connection in api/.env:
- Set QDRANT_ENDPOINT to your running instance (e.g., http://localhost:6333 for a local install, or your Qdrant Cloud URL).
- Optionally set QDRANT_API_KEY if using Qdrant Cloud.

Run Backend

cd api
uv sync
uv run fastapi dev app.py

Run Frontend
```
cd client
npm install
npm run dev
```

See Installation Guide for detailed setup.

Project Structure

.
├── api/                 # Backend (FastAPI)
│   ├── src/
│   │   ├── ingestion/   # Data processing pipeline
│   │   ├── retrieval/   # Search logic & ranking
│   │   ├── llm/         # LLM integration
│   │   └── vectorstore/ # Qdrant interface
├── client/              # Frontend (React)
├── docs/                # Documentation

Documentation Index

Architecture: Deep dive into system components and design.
Installation: Detailed setup and troubleshooting.
Configuration: Environment variables and settings.
Usage: How to use the system effectively.
Workflows: Detailed offline and online pipeline steps.
Evaluation: Offline evaluation pipeline, metrics, and reports.

Limitations

Operates on abstracts only (no full-text indexing).
Requires external services: Groq (LLM for reformulation) and a Qdrant endpoint.
Embedding generation via Fireworks API; throughput and cost depend on API limits.
Limited quantitative evaluation included (primarily qualitative relevance checks).
No cross-encoder re-ranking; ranking relies on dense similarity and RRF fusion.
Not productionized (no auth, observability, or autoscaling in scope).

Future Work

Add hybrid retrieval (dense + BM25) and cross-encoder re-ranking.
Incorporate full-text indexing and PDF parsing.
Add user feedback loops and relevance learning.
Enhance UI with advanced filters, export options, and saved searches.
Batch/streaming ingestion pipeline with resumability and monitoring.
Expand evaluation with curated benchmarks and inter-annotator agreement.

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
api		api
assets		assets
client		client
docs		docs
scripts		scripts
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ACL Anthology Semantic Retrieval System

Motivation

Key Features

Architecture Overview

Demo

Supported Query Modes

1. Natural Language Query

2. ACL Anthology Paper ID

Offline vs Online Steps

Offline (Ingestion)

Online (Retrieval)

Quick Start

Prerequisites

Steps

Project Structure

Documentation Index

Limitations

Future Work

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ACL Anthology Semantic Retrieval System

Motivation

Key Features

Architecture Overview

Demo

Supported Query Modes

1. Natural Language Query

2. ACL Anthology Paper ID

Offline vs Online Steps

Offline (Ingestion)

Online (Retrieval)

Quick Start

Prerequisites

Steps

Project Structure

Documentation Index

Limitations

Future Work

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages