Scientific Literature RAG - Introduction Generator

A production-ready AI-powered system for automatically generating well-structured, literature-informed introductions for scientific papers. Using Retrieval-Augmented Generation (RAG) with semantic search, this tool indexes a corpus of research papers and leverages language models to synthesize relevant literature into comprehensive introductions with properly formatted citations.

Supported LLM Providers: Claude (Anthropic), OpenAI (GPT-4/GPT-4o), Google Gemini, and more.

RAG Interface in Action

Scientific.Literature.RAG.mov

Watch a live demo of the Scientific Literature RAG in action

Features

Multi-Provider LLM Support: Use Claude, OpenAI, Google Gemini, or other compatible models
Scientific Paper Processing: Intelligent PDF extraction with semantic chunking (500-word chunks with 50-word overlap)
SPECTER2 Embeddings: Domain-optimized embeddings trained on scientific literature, achieving superior semantic understanding
GPU Acceleration: Metal GPU support for Apple Silicon, CUDA for NVIDIA, automatic CPU fallback
Vector Search: ChromaDB-powered semantic similarity search with cosine distance metric
Automatic Citations: Extracts and formats references in BibTeX with sophisticated metadata extraction
No External Database Required: Lightweight, persistent local vector storage
Graceful API Fallback: Works with or without API keys—generates prompts for manual use via web interface
Interactive & Auto Modes: CLI supports both interactive menu and automatic pipeline
Fully Configurable: Centralized TOML-based configuration for all parameters

System Architecture

┌─────────────────────────┐
│   PDF Papers            │
│   (data/papers/)        │
└────────┬────────────────┘
         │
         ▼
┌────────────────────────────────┐
│  Document Processor            │
│  - Extract text + metadata     │
│  - Semantic chunking (500w)    │
└────────┬─────────────────────── ┘
         │
         ▼
┌────────────────────────────────┐
│  Embedding Model               │
│  - SPECTER2 (scientific)       │
│  - GPU acceleration (Metal/    │
│    CUDA/CPU fallback)          │
└────────┬───────────────────────┘
         │
         ▼
┌────────────────────────────────┐
│  Vector Database               │
│  - ChromaDB (persistent)       │
│  - Cosine similarity search    │
└────────┬───────────────────────┘
         │
         ▼
┌────────────────────────────────┐
│  Retriever                     │
│  - Top-k semantic search       │
│  - Metadata-aware filtering    │
└────────┬───────────────────────┘
         │
         ▼
┌────────────────────────────────┐
│  Prompt Builder                │
│  - Context assembly            │
│  - Literature formatting       │
└────────┬───────────────────────┘
         │
         ▼
┌────────────────────────────────┐
│  LLM Provider                  │
│  - Claude, OpenAI, Gemini      │
│  - Generate introduction       │
└────────┬───────────────────────┘
         │
         ▼
┌────────────────────────────────┐
│  BibTeX Extractor              │
│  - Metadata extraction         │
│  - Format citations            │
└────────────────────────────────┘

Installation

Prerequisites

Python 3.11+
macOS with Apple Silicon (for Metal GPU support) or any system with CUDA/CPU
uv package manager

Setup

Clone the repository:

git clone https://github.com/jvachier/scientific-literature-rag.git
cd scientific-literature-rag

Install dependencies with uv:

# uv will automatically create a virtual environment and install dependencies
uv sync

Configure your LLM provider (choose one):

Option A: Anthropic Claude (Recommended)

export ANTHROPIC_API_KEY='your-api-key'
# Set in config.toml: provider = "anthropic"

Option B: OpenAI

export OPENAI_API_KEY='your-api-key'
# Set in config.toml: provider = "openai"

Option C: Google Gemini

export GOOGLE_API_KEY='your-api-key'
# Set in config.toml: provider = "google"

For permanent setup, add to your ~/.zshrc or ~/.bashrc:

# Choose your preferred provider
echo 'export ANTHROPIC_API_KEY="your-key"' >> ~/.zshrc
# OR
echo 'export OPENAI_API_KEY="your-key"' >> ~/.zshrc
# OR
echo 'export GOOGLE_API_KEY="your-key"' >> ~/.zshrc

source ~/.zshrc

Project Structure

scientific-literature-rag/
├── src/
│   ├── __init__.py
│   ├── config.py                   # Configuration loader
│   ├── document_processor.py       # PDF extraction & chunking
│   ├── embeddings.py               # SPECTER2 embeddings with Metal GPU
│   ├── retriever.py                # ChromaDB vector search
│   ├── prompt_builder.py           # Claude prompt templates
│   ├── bibtex_extractor.py         # Citation formatting
│   └── introduction_generator.py   # Main orchestrator
├── data/
│   └── papers/                     # Place your PDF papers here
├── output/
│   ├── introductions/              # Generated introductions
│   └── references/                 # BibTeX files
├── chroma_db/                      # Vector database (auto-created)
├── config.toml                     # Configuration file
├── main.py                         # CLI entry point
├── pyproject.toml                  # Project dependencies
├── USAGE.md                        # Detailed usage guide
└── README.md

Usage

Quick Start (Auto Pipeline Mode - Recommended)

The easiest way to use the system is with the automatic pipeline:

1. Add your research papers:

cp /path/to/your/papers/*.pdf data/papers/

2. Run the auto pipeline:

uv run python main.py --auto

This will:

Automatically index all PDFs (or use existing index)
Prompt you for research topic and context
Generate introduction with proper BibTeX citations
Save everything to output/ directory

Example session:

$ uv run python main.py --auto

╔═══════════════════════════════════════════════════════════╗
║   Scientific Literature RAG - AUTO PIPELINE MODE          ║
║   Automatic Indexing + Introduction Generation            ║
╚═══════════════════════════════════════════════════════════╝

Found 6 PDF files in data/papers/

============================================================
RESEARCH INFORMATION
============================================================

Research topic/title: Phase Transitions in Neural Networks

Research context (background, objectives, etc.):
This study investigates phase transitions in deep neural networks
during training, examining critical points and their implications
for optimization strategies.
[Press Enter twice]

Output filename (without extension) [introduction]: neural_phase_transitions

============================================================
GENERATING INTRODUCTION
============================================================
Retrieving 15 relevant chunks...
Retrieved 15 chunks from literature
Calling Claude API (claude-3-5-sonnet-20241022)...
Introduction generated successfully
Generating BibTeX references...

============================================================
PIPELINE COMPLETE!
============================================================
✓ Introduction: output/introductions/neural_phase_transitions.txt
✓ BibTeX file: output/references/neural_phase_transitions.bib
============================================================

Advanced: Interactive Menu Mode

For more control, use the interactive menu:

uv run python main.py --interactive

This provides a menu with options to:

Index papers (first time or to update)
Generate introduction
View collection stats
Exit

Generated Output

Files are saved in the output/ directory:

output/introductions/<name>.txt - Generated introduction with citations
output/references/<name>.bib - LaTeX-ready BibTeX file

Using BibTeX Citations in LaTeX

The generated .bib file is ready to use directly in your LaTeX documents:

\documentclass{article}
\bibliographystyle{plain}

\begin{document}

\section{Introduction}
Phase transitions in neural networks have been extensively studied
\cite{Ziyin2022, Halverson2021}...

\bibliography{references/neural_phase_transitions}
\end{document}

Example Output

Files are saved in the output/ directory after running the pipeline:

Generated Introduction

File: output/introductions/introduction_example.md

Active matter comprises systems of self-propelled particles that continuously consume
energy to generate directed motion, representing a frontier in soft matter physics and
nonequilibrium statistical mechanics. This fundamental departure from passive systems
in thermal equilibrium creates rich dynamical phenomena and novel collective behaviors
inaccessible to quiescent materials. The defining characteristic of active matter lies
in the intrinsic capacity of particles to convert chemical, thermal, or other energy
sources into mechanical work, driving the system far from equilibrium and enabling
self-organization, directed transport, and complex emergent dynamics.

Active particles span biological and synthetic scales with diverse mechanisms of
propulsion. In biological contexts, active particles frequently compete with multiple
environmental forces simultaneously. For instance, microorganisms embedded in ice
develop sophisticated strategies including the secretion of exopolymeric substances
(EPS) and antifreeze glycoproteins (AFP) that enhance interfacial liquidity, thereby
modifying their interaction with their frozen environment [Wettlaufer]. Understanding
how such bioparticles respond to environmental forcing—including temperature and
chemical gradients—proves essential for applications spanning astrobiology,
paleoclimatology, and materials science [VachierPhysicalReview2021].

[continues with well-structured, cited introduction]

BibTeX References

File: output/references/test_test.bib

The BibTeX extractor generates complete, LaTeX-ready citations with:

Proper author formatting
Title, year, journal, volume, pages
DOI when available
Clean citation keys (AuthorYYYY format)

@misc{CalvinKLee2020,
 author = {Lee, C. and others},
 note = {Source: lee-et-al-2020-social-cooperativity-of-bacteria-during-reversible-surface-attachment-in-young-biofilms-a-quantitative.pdf},
 title = {Social Cooperativity of Bacteria during Reversible Surface Attachment in Young Biofilms: a Quantitative Comparison of Pseudomonas aeruginosa PA14 and PAO1},
 year = {2020}
}

@misc{FragkopoulosAA2021,
 author = {Fragkopoulos, A. and others},
 doi = {10.1098/rsif.2021.0553},
 note = {Source: fragkopoulos-et-al-2021-self-generated-oxygen-gradients-control-collective-aggregation-of-photosynthetic-microbes.pdf},
 title = {Self-generated oxygen gradients control collective aggregation of photosynthetic microbes},
 year = {2021}
}

@article{VachierPhysicalReview2021,
 author = {Vachier, J.},
 journal = {Physical Review E},
 note = {Source: PhysRevE.105.024601.pdf},
 pages = {024601},
 title = {Premelting controlled active matter in ice},
 volume = {105},
 year = {2021}
}

These entries work directly in LaTeX with \cite{VachierPhysicalReview2021} commands!

Configuration

All system settings are centralized in config.toml:

[paths]
data_dir = "./data/papers"
chroma_dir = "./chroma_db"
output_dir = "./output"

[processing]
chunk_size = 500          # Words per chunk
chunk_overlap = 50        # Overlap between chunks

[embedding]
model_name = "allenai/specter2_base"
use_mps = true            # Use Metal GPU on Apple Silicon
batch_size = 32

[retrieval]
collection_name = "scientific_papers"
n_chunks = 10             # Number of chunks to retrieve
similarity_metric = "cosine"

[generation]
provider = "anthropic"    # Options: anthropic, openai, google
model = "claude-3-5-sonnet-20241022"
max_tokens = 2000
temperature = 0.7

[logging]
level = "INFO"
format = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"

Common Customizations

Change chunk size for different document types:

[processing]
chunk_size = 1000  # Larger chunks for review articles
chunk_overlap = 100

Use different embedding model:

[embedding]
model_name = "allenai/specter2_aug2020"  # Alternative SPECTER version
use_mps = false  # Disable GPU on non-Apple systems

Adjust retrieval for more context:

[retrieval]
n_chunks = 15  # Retrieve more chunks for complex topics

Use different LLM models:

# Claude (Anthropic)
[generation]
provider = "anthropic"
model = "claude-3-5-sonnet-20241022"  # Best quality
# model = "claude-3-haiku-20240307"   # Faster, cheaper

# OpenAI
[generation]
provider = "openai"
model = "gpt-4-turbo"                 # High quality
# model = "gpt-4o"                     # Latest, efficient
# model = "gpt-3.5-turbo"              # Cost-effective

# Google Gemini
[generation]
provider = "google"
model = "gemini-pro"
# model = "gemini-1.5-pro"             # Latest version

Customize prompts for your domain:

[generation.prompts]
system_prompt = """You are an expert in theoretical physics and quantum mechanics..."""

introduction_template = """Write an introduction focusing on mathematical rigor..."""

See config.toml for full prompt templates that can be customized.

Performance

Embedding & Search (Apple M1/M2/M3 with Metal GPU):

Embedding generation: 50-100 chunks/second
Vector search: <100ms for 10 results
Total indexing: ~1 minute per 100 PDFs

Introduction Generation:

Claude API: 10-30 seconds per introduction
OpenAI GPT-4: 15-40 seconds per introduction
Google Gemini: 10-25 seconds per introduction
Times vary based on API latency and model selection

Typical API Costs (per introduction):

Claude 3.5 Sonnet: $0.05-0.10
OpenAI GPT-4-Turbo: $0.08-0.15
Google Gemini: $0.02-0.05

RAG Evaluation with ROUGE Scores

The system includes a comprehensive evaluation module for assessing the quality of generated introductions using ROUGE metrics.

ROUGE Metrics Overview

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures text similarity between generated and reference texts:

ROUGE-1: Unigram (single word) overlap between texts
ROUGE-2: Bigram (two-word sequence) overlap
ROUGE-L: Longest Common Subsequence similarity

Each metric reports:

Precision: Proportion of generated text words in reference
Recall: Proportion of reference words found in generated text
F1: Harmonic mean of precision and recall

Using the Evaluator

from src.rag_evaluator import RAGEvaluator

# Initialize evaluator
evaluator = RAGEvaluator()

# Evaluate a single generation
reference = "Your reference introduction text..."
generated = "Generated introduction from the RAG system..."

results = evaluator.evaluate_generation(reference, generated)

# Access ROUGE scores
print(f"ROUGE-1 F1: {results['rouge_scores']['rouge1']['f1']}")
print(f"ROUGE-2 F1: {results['rouge_scores']['rouge2']['f1']}")
print(f"ROUGE-L F1: {results['rouge_scores']['rougeL']['f1']}")

# Access text statistics
print(f"Token count: {results['text_statistics']['token_count']}")
print(f"Type-Token Ratio: {results['text_statistics']['type_token_ratio']}")

Batch Evaluation

Evaluate multiple introductions against reference texts:

references = ["ref_intro_1", "ref_intro_2", "ref_intro_3"]
generated = ["gen_intro_1", "gen_intro_2", "gen_intro_3"]

batch_results = evaluator.batch_evaluate(references, generated)

print(f"Average ROUGE-1 F1: {batch_results['rouge_averages']['rouge1_f1_avg']}")
print(f"Average token count: {batch_results['text_statistics_averages']['avg_token_count']}")

Evaluation with Retrieved Documents

Include retrieved documents in evaluation to assess retrieval quality:

retrieved_docs = ["chunk_1", "chunk_2", "chunk_3"]

results = evaluator.evaluate_generation(
    reference=reference,
    generated=generated,
    retrieved_docs=retrieved_docs
)

# Evaluate how well retrieved docs match the reference
print(f"Avg retrieval relevance: {results['retrieval_metrics']['average_relevance_score']}")

Metrics Interpretation

Good ROUGE Scores (>0.4):

Indicates substantial semantic overlap
Generated text captures main points from reference
Suitable for automatic evaluation

Moderate ROUGE Scores (0.2-0.4):

Some semantic overlap
Paraphrasing or different phrasing reduces scores
May still be acceptable depending on use case

Low ROUGE Scores (<0.2):

Minimal overlap with reference
May indicate retrieval or generation issues
Review retrieved documents and prompt engineering

Example Evaluation Results

Evaluation of the example introduction "Introduction to Active Matter" (from output/introductions/introduction_example.md) against reference content from the first three paragraphs:

Metric	F1 Score	Precision	Recall
ROUGE-1	0.5245	0.3555	1.0000
ROUGE-2	0.5178	0.3507	0.9890
ROUGE-L	0.5245	0.3555	1.0000

Generation Statistics:

Token Count: 758 tokens
Sentence Count: 33 sentences
Average Sentence Length: 22.97 tokens
Type-Token Ratio: 0.5554 (good vocabulary richness)

Interpretation:

ROUGE-1 (0.5245): Good score indicating strong word-level coverage of reference content
ROUGE-2 (0.5178): Good score showing strong phrase-level overlap
ROUGE-L (0.5245): Good score reflecting strong structural similarity (perfect recall = 1.0)

The good ROUGE scores (>0.4) demonstrate that the generated introduction successfully captures and reproduces key content from the reference while maintaining high-quality academic writing with varied phrasing and structure.

To run this evaluation yourself:

uv run python scripts/test_rouge.py

Troubleshooting

Metal GPU Not Detected

If you see "Using CPU" instead of "Using Metal GPU acceleration":

# Check if MPS is available
uv run python -c "import torch; print(torch.backends.mps.is_available())"

If False, ensure you're using a compatible PyTorch version on Apple Silicon.

Import Errors

Make sure you're running commands with uv run:

uv run python main.py

ChromaDB Issues

If you encounter ChromaDB errors, try clearing the database:

rm -rf chroma_db/

Then re-index your papers.

Development

Setup Development Environment

Install development dependencies:

# Install with dev tools
uv sync --all-extras

# Or install just dev dependencies
uv add --dev pytest pytest-cov black ruff mypy bandit interrogate pre-commit

Running Tests

# Run all tests
uv run pytest

# Run tests with coverage
uv run pytest --cov=src --cov-report=html

# Run specific test file
uv run pytest tests/test_config.py -v

# Run tests in parallel
uv run pytest -n auto

Code Quality

Pre-commit Hooks

Set up pre-commit hooks to run checks automatically on every commit:

# Install pre-commit hooks
uv run pre-commit install

# Run hooks manually on all files
uv run pre-commit run --all-files

Pre-commit checks:

Black: Code formatting
Ruff: Linting (import sorting, naming, etc.)
MyPy: Static type checking
Bandit: Security vulnerability scanning
Interrogate: Docstring coverage

Manual Code Quality Checks

# Format code with Black
uv run black src/ tests/

# Check linting with Ruff
uv run ruff check src/ tests/ --fix

# Type checking
uv run mypy src/ --ignore-missing-imports

# Security checks
uv run bandit -r src/

# Docstring coverage
uv run interrogate src/ -v -I -i -n

CI/CD Pipeline

The project uses GitHub Actions for continuous integration:

Workflows:

tests.yml: Runs tests on Python 3.11-3.12, macOS/Ubuntu/Windows
code-quality.yml: Linting, type checking, security checks
release.yml: Automated releases and PyPI publishing

View workflow status: Actions

Adding Dependencies

uv add package-name

# Add dev dependency
uv add --dev package-name

Why No Docker?

This project intentionally does not use Docker for the following reasons:

GPU Access Complexity: Metal GPU (Apple Silicon) and CUDA access is significantly more complex in containers
Model Download Size: SPECTER2 and sentence-transformers models (~500MB) would need re-downloading on each container rebuild
Development Workflow: Local development with uv is simpler and faster for research tools
File Persistence: ChromaDB and output files work more reliably with native filesystem access
Performance: Native execution provides better performance for ML workloads

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

Apache License 2.0 - see LICENSE file for details

Citation

If you use this tool in your research, please cite:

@software{scientific_literature_rag,
  title = {Scientific Literature RAG - Introduction Generator},
  author = {Jeremy Vachier},
  year = {2025},
  url = {https://github.com/jvachier/scientific-literature-rag}
}

Contact

Author: Jeremy Vachier

GitHub: @jvachier

For questions or issues, please use the GitHub issue tracker.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github		.github
output		output
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CONTRIBUTING.md		CONTRIBUTING.md
DASH_APP.md		DASH_APP.md
LICENSE		LICENSE
README.md		README.md
USAGE.md		USAGE.md
app_run.py		app_run.py
config.toml		config.toml
main.py		main.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Scientific Literature RAG - Introduction Generator

RAG Interface in Action

Features

System Architecture

Installation

Prerequisites

Setup

Project Structure

Usage

Quick Start (Auto Pipeline Mode - Recommended)

Advanced: Interactive Menu Mode

Generated Output

Using BibTeX Citations in LaTeX

Example Output

Generated Introduction

BibTeX References

Configuration

Common Customizations

Performance

RAG Evaluation with ROUGE Scores

ROUGE Metrics Overview

Using the Evaluator

Batch Evaluation

Evaluation with Retrieved Documents

Metrics Interpretation

Example Evaluation Results

Troubleshooting

Metal GPU Not Detected

Import Errors

ChromaDB Issues

Development

Setup Development Environment

Running Tests

Code Quality

Pre-commit Hooks

Manual Code Quality Checks

CI/CD Pipeline

Adding Dependencies

Why No Docker?

Contributing

License

Citation

Contact

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages