Skip to content

jvachier/scientific-literature-rag

Repository files navigation

Scientific Literature RAG - Introduction Generator

Python 3.11+ License: Apache 2.0 Status: Production Ready RAG LLM GPU ROUGE Evaluation Dash UI ChromaDB SPECTER2

A production-ready AI-powered system for automatically generating well-structured, literature-informed introductions for scientific papers. Using Retrieval-Augmented Generation (RAG) with semantic search, this tool indexes a corpus of research papers and leverages language models to synthesize relevant literature into comprehensive introductions with properly formatted citations.

Supported LLM Providers: Claude (Anthropic), OpenAI (GPT-4/GPT-4o), Google Gemini, and more.

RAG Interface in Action

Scientific.Literature.RAG.mov

Watch a live demo of the Scientific Literature RAG in action

Features

  • Multi-Provider LLM Support: Use Claude, OpenAI, Google Gemini, or other compatible models
  • Scientific Paper Processing: Intelligent PDF extraction with semantic chunking (500-word chunks with 50-word overlap)
  • SPECTER2 Embeddings: Domain-optimized embeddings trained on scientific literature, achieving superior semantic understanding
  • GPU Acceleration: Metal GPU support for Apple Silicon, CUDA for NVIDIA, automatic CPU fallback
  • Vector Search: ChromaDB-powered semantic similarity search with cosine distance metric
  • Automatic Citations: Extracts and formats references in BibTeX with sophisticated metadata extraction
  • No External Database Required: Lightweight, persistent local vector storage
  • Graceful API Fallback: Works with or without API keys—generates prompts for manual use via web interface
  • Interactive & Auto Modes: CLI supports both interactive menu and automatic pipeline
  • Fully Configurable: Centralized TOML-based configuration for all parameters

System Architecture

┌─────────────────────────┐
│   PDF Papers            │
│   (data/papers/)        │
└────────┬────────────────┘
         │
         ▼
┌────────────────────────────────┐
│  Document Processor            │
│  - Extract text + metadata     │
│  - Semantic chunking (500w)    │
└────────┬─────────────────────── ┘
         │
         ▼
┌────────────────────────────────┐
│  Embedding Model               │
│  - SPECTER2 (scientific)       │
│  - GPU acceleration (Metal/    │
│    CUDA/CPU fallback)          │
└────────┬───────────────────────┘
         │
         ▼
┌────────────────────────────────┐
│  Vector Database               │
│  - ChromaDB (persistent)       │
│  - Cosine similarity search    │
└────────┬───────────────────────┘
         │
         ▼
┌────────────────────────────────┐
│  Retriever                     │
│  - Top-k semantic search       │
│  - Metadata-aware filtering    │
└────────┬───────────────────────┘
         │
         ▼
┌────────────────────────────────┐
│  Prompt Builder                │
│  - Context assembly            │
│  - Literature formatting       │
└────────┬───────────────────────┘
         │
         ▼
┌────────────────────────────────┐
│  LLM Provider                  │
│  - Claude, OpenAI, Gemini      │
│  - Generate introduction       │
└────────┬───────────────────────┘
         │
         ▼
┌────────────────────────────────┐
│  BibTeX Extractor              │
│  - Metadata extraction         │
│  - Format citations            │
└────────────────────────────────┘

Installation

Prerequisites

  • Python 3.11+
  • macOS with Apple Silicon (for Metal GPU support) or any system with CUDA/CPU
  • uv package manager

Setup

  1. Clone the repository:
git clone https://github.com/jvachier/scientific-literature-rag.git
cd scientific-literature-rag
  1. Install dependencies with uv:
# uv will automatically create a virtual environment and install dependencies
uv sync
  1. Configure your LLM provider (choose one):

Option A: Anthropic Claude (Recommended)

export ANTHROPIC_API_KEY='your-api-key'
# Set in config.toml: provider = "anthropic"

Option B: OpenAI

export OPENAI_API_KEY='your-api-key'
# Set in config.toml: provider = "openai"

Option C: Google Gemini

export GOOGLE_API_KEY='your-api-key'
# Set in config.toml: provider = "google"

For permanent setup, add to your ~/.zshrc or ~/.bashrc:

# Choose your preferred provider
echo 'export ANTHROPIC_API_KEY="your-key"' >> ~/.zshrc
# OR
echo 'export OPENAI_API_KEY="your-key"' >> ~/.zshrc
# OR
echo 'export GOOGLE_API_KEY="your-key"' >> ~/.zshrc

source ~/.zshrc

Project Structure

scientific-literature-rag/
├── src/
│   ├── __init__.py
│   ├── config.py                   # Configuration loader
│   ├── document_processor.py       # PDF extraction & chunking
│   ├── embeddings.py               # SPECTER2 embeddings with Metal GPU
│   ├── retriever.py                # ChromaDB vector search
│   ├── prompt_builder.py           # Claude prompt templates
│   ├── bibtex_extractor.py         # Citation formatting
│   └── introduction_generator.py   # Main orchestrator
├── data/
│   └── papers/                     # Place your PDF papers here
├── output/
│   ├── introductions/              # Generated introductions
│   └── references/                 # BibTeX files
├── chroma_db/                      # Vector database (auto-created)
├── config.toml                     # Configuration file
├── main.py                         # CLI entry point
├── pyproject.toml                  # Project dependencies
├── USAGE.md                        # Detailed usage guide
└── README.md

Usage

Quick Start (Auto Pipeline Mode - Recommended)

The easiest way to use the system is with the automatic pipeline:

1. Add your research papers:

cp /path/to/your/papers/*.pdf data/papers/

2. Run the auto pipeline:

uv run python main.py --auto

This will:

  1. Automatically index all PDFs (or use existing index)
  2. Prompt you for research topic and context
  3. Generate introduction with proper BibTeX citations
  4. Save everything to output/ directory

Example session:

$ uv run python main.py --auto

╔═══════════════════════════════════════════════════════════╗
║   Scientific Literature RAG - AUTO PIPELINE MODE          ║
║   Automatic Indexing + Introduction Generation            ║
╚═══════════════════════════════════════════════════════════╝

Found 6 PDF files in data/papers/

============================================================
RESEARCH INFORMATION
============================================================

Research topic/title: Phase Transitions in Neural Networks

Research context (background, objectives, etc.):
This study investigates phase transitions in deep neural networks
during training, examining critical points and their implications
for optimization strategies.
[Press Enter twice]

Output filename (without extension) [introduction]: neural_phase_transitions

============================================================
GENERATING INTRODUCTION
============================================================
Retrieving 15 relevant chunks...
Retrieved 15 chunks from literature
Calling Claude API (claude-3-5-sonnet-20241022)...
Introduction generated successfully
Generating BibTeX references...

============================================================
PIPELINE COMPLETE!
============================================================
✓ Introduction: output/introductions/neural_phase_transitions.txt
✓ BibTeX file: output/references/neural_phase_transitions.bib
============================================================

Advanced: Interactive Menu Mode

For more control, use the interactive menu:

uv run python main.py --interactive

This provides a menu with options to:

  1. Index papers (first time or to update)
  2. Generate introduction
  3. View collection stats
  4. Exit

Generated Output

Files are saved in the output/ directory:

  • output/introductions/<name>.txt - Generated introduction with citations
  • output/references/<name>.bib - LaTeX-ready BibTeX file

Using BibTeX Citations in LaTeX

The generated .bib file is ready to use directly in your LaTeX documents:

\documentclass{article}
\bibliographystyle{plain}

\begin{document}

\section{Introduction}
Phase transitions in neural networks have been extensively studied
\cite{Ziyin2022, Halverson2021}...

\bibliography{references/neural_phase_transitions}
\end{document}

Example Output

Files are saved in the output/ directory after running the pipeline:

Generated Introduction

File: output/introductions/introduction_example.md

Active matter comprises systems of self-propelled particles that continuously consume
energy to generate directed motion, representing a frontier in soft matter physics and
nonequilibrium statistical mechanics. This fundamental departure from passive systems
in thermal equilibrium creates rich dynamical phenomena and novel collective behaviors
inaccessible to quiescent materials. The defining characteristic of active matter lies
in the intrinsic capacity of particles to convert chemical, thermal, or other energy
sources into mechanical work, driving the system far from equilibrium and enabling
self-organization, directed transport, and complex emergent dynamics.

Active particles span biological and synthetic scales with diverse mechanisms of
propulsion. In biological contexts, active particles frequently compete with multiple
environmental forces simultaneously. For instance, microorganisms embedded in ice
develop sophisticated strategies including the secretion of exopolymeric substances
(EPS) and antifreeze glycoproteins (AFP) that enhance interfacial liquidity, thereby
modifying their interaction with their frozen environment [Wettlaufer]. Understanding
how such bioparticles respond to environmental forcing—including temperature and
chemical gradients—proves essential for applications spanning astrobiology,
paleoclimatology, and materials science [VachierPhysicalReview2021].

[continues with well-structured, cited introduction]

BibTeX References

File: output/references/test_test.bib

The BibTeX extractor generates complete, LaTeX-ready citations with:

  • Proper author formatting
  • Title, year, journal, volume, pages
  • DOI when available
  • Clean citation keys (AuthorYYYY format)
@misc{CalvinKLee2020,
 author = {Lee, C. and others},
 note = {Source: lee-et-al-2020-social-cooperativity-of-bacteria-during-reversible-surface-attachment-in-young-biofilms-a-quantitative.pdf},
 title = {Social Cooperativity of Bacteria during Reversible Surface Attachment in Young Biofilms: a Quantitative Comparison of Pseudomonas aeruginosa PA14 and PAO1},
 year = {2020}
}

@misc{FragkopoulosAA2021,
 author = {Fragkopoulos, A. and others},
 doi = {10.1098/rsif.2021.0553},
 note = {Source: fragkopoulos-et-al-2021-self-generated-oxygen-gradients-control-collective-aggregation-of-photosynthetic-microbes.pdf},
 title = {Self-generated oxygen gradients control collective aggregation of photosynthetic microbes},
 year = {2021}
}

@article{VachierPhysicalReview2021,
 author = {Vachier, J.},
 journal = {Physical Review E},
 note = {Source: PhysRevE.105.024601.pdf},
 pages = {024601},
 title = {Premelting controlled active matter in ice},
 volume = {105},
 year = {2021}
}

These entries work directly in LaTeX with \cite{VachierPhysicalReview2021} commands!

Configuration

All system settings are centralized in config.toml:

[paths]
data_dir = "./data/papers"
chroma_dir = "./chroma_db"
output_dir = "./output"

[processing]
chunk_size = 500          # Words per chunk
chunk_overlap = 50        # Overlap between chunks

[embedding]
model_name = "allenai/specter2_base"
use_mps = true            # Use Metal GPU on Apple Silicon
batch_size = 32

[retrieval]
collection_name = "scientific_papers"
n_chunks = 10             # Number of chunks to retrieve
similarity_metric = "cosine"

[generation]
provider = "anthropic"    # Options: anthropic, openai, google
model = "claude-3-5-sonnet-20241022"
max_tokens = 2000
temperature = 0.7

[logging]
level = "INFO"
format = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"

Common Customizations

Change chunk size for different document types:

[processing]
chunk_size = 1000  # Larger chunks for review articles
chunk_overlap = 100

Use different embedding model:

[embedding]
model_name = "allenai/specter2_aug2020"  # Alternative SPECTER version
use_mps = false  # Disable GPU on non-Apple systems

Adjust retrieval for more context:

[retrieval]
n_chunks = 15  # Retrieve more chunks for complex topics

Use different LLM models:

# Claude (Anthropic)
[generation]
provider = "anthropic"
model = "claude-3-5-sonnet-20241022"  # Best quality
# model = "claude-3-haiku-20240307"   # Faster, cheaper

# OpenAI
[generation]
provider = "openai"
model = "gpt-4-turbo"                 # High quality
# model = "gpt-4o"                     # Latest, efficient
# model = "gpt-3.5-turbo"              # Cost-effective

# Google Gemini
[generation]
provider = "google"
model = "gemini-pro"
# model = "gemini-1.5-pro"             # Latest version

Customize prompts for your domain:

[generation.prompts]
system_prompt = """You are an expert in theoretical physics and quantum mechanics..."""

introduction_template = """Write an introduction focusing on mathematical rigor..."""

See config.toml for full prompt templates that can be customized.

Performance

Embedding & Search (Apple M1/M2/M3 with Metal GPU):

  • Embedding generation: 50-100 chunks/second
  • Vector search: <100ms for 10 results
  • Total indexing: ~1 minute per 100 PDFs

Introduction Generation:

  • Claude API: 10-30 seconds per introduction
  • OpenAI GPT-4: 15-40 seconds per introduction
  • Google Gemini: 10-25 seconds per introduction
  • Times vary based on API latency and model selection

Typical API Costs (per introduction):

  • Claude 3.5 Sonnet: $0.05-0.10
  • OpenAI GPT-4-Turbo: $0.08-0.15
  • Google Gemini: $0.02-0.05

RAG Evaluation with ROUGE Scores

The system includes a comprehensive evaluation module for assessing the quality of generated introductions using ROUGE metrics.

ROUGE Metrics Overview

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures text similarity between generated and reference texts:

  • ROUGE-1: Unigram (single word) overlap between texts
  • ROUGE-2: Bigram (two-word sequence) overlap
  • ROUGE-L: Longest Common Subsequence similarity

Each metric reports:

  • Precision: Proportion of generated text words in reference
  • Recall: Proportion of reference words found in generated text
  • F1: Harmonic mean of precision and recall

Using the Evaluator

from src.rag_evaluator import RAGEvaluator

# Initialize evaluator
evaluator = RAGEvaluator()

# Evaluate a single generation
reference = "Your reference introduction text..."
generated = "Generated introduction from the RAG system..."

results = evaluator.evaluate_generation(reference, generated)

# Access ROUGE scores
print(f"ROUGE-1 F1: {results['rouge_scores']['rouge1']['f1']}")
print(f"ROUGE-2 F1: {results['rouge_scores']['rouge2']['f1']}")
print(f"ROUGE-L F1: {results['rouge_scores']['rougeL']['f1']}")

# Access text statistics
print(f"Token count: {results['text_statistics']['token_count']}")
print(f"Type-Token Ratio: {results['text_statistics']['type_token_ratio']}")

Batch Evaluation

Evaluate multiple introductions against reference texts:

references = ["ref_intro_1", "ref_intro_2", "ref_intro_3"]
generated = ["gen_intro_1", "gen_intro_2", "gen_intro_3"]

batch_results = evaluator.batch_evaluate(references, generated)

print(f"Average ROUGE-1 F1: {batch_results['rouge_averages']['rouge1_f1_avg']}")
print(f"Average token count: {batch_results['text_statistics_averages']['avg_token_count']}")

Evaluation with Retrieved Documents

Include retrieved documents in evaluation to assess retrieval quality:

retrieved_docs = ["chunk_1", "chunk_2", "chunk_3"]

results = evaluator.evaluate_generation(
    reference=reference,
    generated=generated,
    retrieved_docs=retrieved_docs
)

# Evaluate how well retrieved docs match the reference
print(f"Avg retrieval relevance: {results['retrieval_metrics']['average_relevance_score']}")

Metrics Interpretation

Good ROUGE Scores (>0.4):

  • Indicates substantial semantic overlap
  • Generated text captures main points from reference
  • Suitable for automatic evaluation

Moderate ROUGE Scores (0.2-0.4):

  • Some semantic overlap
  • Paraphrasing or different phrasing reduces scores
  • May still be acceptable depending on use case

Low ROUGE Scores (<0.2):

  • Minimal overlap with reference
  • May indicate retrieval or generation issues
  • Review retrieved documents and prompt engineering

Example Evaluation Results

Evaluation of the example introduction "Introduction to Active Matter" (from output/introductions/introduction_example.md) against reference content from the first three paragraphs:

Metric F1 Score Precision Recall
ROUGE-1 0.5245 0.3555 1.0000
ROUGE-2 0.5178 0.3507 0.9890
ROUGE-L 0.5245 0.3555 1.0000

Generation Statistics:

  • Token Count: 758 tokens
  • Sentence Count: 33 sentences
  • Average Sentence Length: 22.97 tokens
  • Type-Token Ratio: 0.5554 (good vocabulary richness)

Interpretation:

  • ROUGE-1 (0.5245): Good score indicating strong word-level coverage of reference content
  • ROUGE-2 (0.5178): Good score showing strong phrase-level overlap
  • ROUGE-L (0.5245): Good score reflecting strong structural similarity (perfect recall = 1.0)

The good ROUGE scores (>0.4) demonstrate that the generated introduction successfully captures and reproduces key content from the reference while maintaining high-quality academic writing with varied phrasing and structure.

To run this evaluation yourself:

uv run python scripts/test_rouge.py

Troubleshooting

Metal GPU Not Detected

If you see "Using CPU" instead of "Using Metal GPU acceleration":

# Check if MPS is available
uv run python -c "import torch; print(torch.backends.mps.is_available())"

If False, ensure you're using a compatible PyTorch version on Apple Silicon.

Import Errors

Make sure you're running commands with uv run:

uv run python main.py

ChromaDB Issues

If you encounter ChromaDB errors, try clearing the database:

rm -rf chroma_db/

Then re-index your papers.

Development

Setup Development Environment

Install development dependencies:

# Install with dev tools
uv sync --all-extras

# Or install just dev dependencies
uv add --dev pytest pytest-cov black ruff mypy bandit interrogate pre-commit

Running Tests

# Run all tests
uv run pytest

# Run tests with coverage
uv run pytest --cov=src --cov-report=html

# Run specific test file
uv run pytest tests/test_config.py -v

# Run tests in parallel
uv run pytest -n auto

Code Quality

Pre-commit Hooks

Set up pre-commit hooks to run checks automatically on every commit:

# Install pre-commit hooks
uv run pre-commit install

# Run hooks manually on all files
uv run pre-commit run --all-files

Pre-commit checks:

  • Black: Code formatting
  • Ruff: Linting (import sorting, naming, etc.)
  • MyPy: Static type checking
  • Bandit: Security vulnerability scanning
  • Interrogate: Docstring coverage

Manual Code Quality Checks

# Format code with Black
uv run black src/ tests/

# Check linting with Ruff
uv run ruff check src/ tests/ --fix

# Type checking
uv run mypy src/ --ignore-missing-imports

# Security checks
uv run bandit -r src/

# Docstring coverage
uv run interrogate src/ -v -I -i -n

CI/CD Pipeline

The project uses GitHub Actions for continuous integration:

Workflows:

  • tests.yml: Runs tests on Python 3.11-3.12, macOS/Ubuntu/Windows
  • code-quality.yml: Linting, type checking, security checks
  • release.yml: Automated releases and PyPI publishing

View workflow status: Actions

Adding Dependencies

uv add package-name

# Add dev dependency
uv add --dev package-name

Why No Docker?

This project intentionally does not use Docker for the following reasons:

  1. GPU Access Complexity: Metal GPU (Apple Silicon) and CUDA access is significantly more complex in containers
  2. Model Download Size: SPECTER2 and sentence-transformers models (~500MB) would need re-downloading on each container rebuild
  3. Development Workflow: Local development with uv is simpler and faster for research tools
  4. File Persistence: ChromaDB and output files work more reliably with native filesystem access
  5. Performance: Native execution provides better performance for ML workloads

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

Apache License 2.0 - see LICENSE file for details

Citation

If you use this tool in your research, please cite:

@software{scientific_literature_rag,
  title = {Scientific Literature RAG - Introduction Generator},
  author = {Jeremy Vachier},
  year = {2025},
  url = {https://github.com/jvachier/scientific-literature-rag}
}

Contact

Author: Jeremy Vachier

GitHub: @jvachier

For questions or issues, please use the GitHub issue tracker.

About

Production RAG system for scientific literature synthesis with SPECTER2 embeddings, Metal GPU acceleration, multi-LLM support, and automatic BibTeX citations.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors