A production-ready AI-powered system for automatically generating well-structured, literature-informed introductions for scientific papers. Using Retrieval-Augmented Generation (RAG) with semantic search, this tool indexes a corpus of research papers and leverages language models to synthesize relevant literature into comprehensive introductions with properly formatted citations.
Supported LLM Providers: Claude (Anthropic), OpenAI (GPT-4/GPT-4o), Google Gemini, and more.
Scientific.Literature.RAG.mov
Watch a live demo of the Scientific Literature RAG in action
- Multi-Provider LLM Support: Use Claude, OpenAI, Google Gemini, or other compatible models
- Scientific Paper Processing: Intelligent PDF extraction with semantic chunking (500-word chunks with 50-word overlap)
- SPECTER2 Embeddings: Domain-optimized embeddings trained on scientific literature, achieving superior semantic understanding
- GPU Acceleration: Metal GPU support for Apple Silicon, CUDA for NVIDIA, automatic CPU fallback
- Vector Search: ChromaDB-powered semantic similarity search with cosine distance metric
- Automatic Citations: Extracts and formats references in BibTeX with sophisticated metadata extraction
- No External Database Required: Lightweight, persistent local vector storage
- Graceful API Fallback: Works with or without API keys—generates prompts for manual use via web interface
- Interactive & Auto Modes: CLI supports both interactive menu and automatic pipeline
- Fully Configurable: Centralized TOML-based configuration for all parameters
┌─────────────────────────┐
│ PDF Papers │
│ (data/papers/) │
└────────┬────────────────┘
│
▼
┌────────────────────────────────┐
│ Document Processor │
│ - Extract text + metadata │
│ - Semantic chunking (500w) │
└────────┬─────────────────────── ┘
│
▼
┌────────────────────────────────┐
│ Embedding Model │
│ - SPECTER2 (scientific) │
│ - GPU acceleration (Metal/ │
│ CUDA/CPU fallback) │
└────────┬───────────────────────┘
│
▼
┌────────────────────────────────┐
│ Vector Database │
│ - ChromaDB (persistent) │
│ - Cosine similarity search │
└────────┬───────────────────────┘
│
▼
┌────────────────────────────────┐
│ Retriever │
│ - Top-k semantic search │
│ - Metadata-aware filtering │
└────────┬───────────────────────┘
│
▼
┌────────────────────────────────┐
│ Prompt Builder │
│ - Context assembly │
│ - Literature formatting │
└────────┬───────────────────────┘
│
▼
┌────────────────────────────────┐
│ LLM Provider │
│ - Claude, OpenAI, Gemini │
│ - Generate introduction │
└────────┬───────────────────────┘
│
▼
┌────────────────────────────────┐
│ BibTeX Extractor │
│ - Metadata extraction │
│ - Format citations │
└────────────────────────────────┘
- Python 3.11+
- macOS with Apple Silicon (for Metal GPU support) or any system with CUDA/CPU
- uv package manager
- Clone the repository:
git clone https://github.com/jvachier/scientific-literature-rag.git
cd scientific-literature-rag- Install dependencies with uv:
# uv will automatically create a virtual environment and install dependencies
uv sync- Configure your LLM provider (choose one):
Option A: Anthropic Claude (Recommended)
export ANTHROPIC_API_KEY='your-api-key'
# Set in config.toml: provider = "anthropic"Option B: OpenAI
export OPENAI_API_KEY='your-api-key'
# Set in config.toml: provider = "openai"Option C: Google Gemini
export GOOGLE_API_KEY='your-api-key'
# Set in config.toml: provider = "google"For permanent setup, add to your ~/.zshrc or ~/.bashrc:
# Choose your preferred provider
echo 'export ANTHROPIC_API_KEY="your-key"' >> ~/.zshrc
# OR
echo 'export OPENAI_API_KEY="your-key"' >> ~/.zshrc
# OR
echo 'export GOOGLE_API_KEY="your-key"' >> ~/.zshrc
source ~/.zshrcscientific-literature-rag/
├── src/
│ ├── __init__.py
│ ├── config.py # Configuration loader
│ ├── document_processor.py # PDF extraction & chunking
│ ├── embeddings.py # SPECTER2 embeddings with Metal GPU
│ ├── retriever.py # ChromaDB vector search
│ ├── prompt_builder.py # Claude prompt templates
│ ├── bibtex_extractor.py # Citation formatting
│ └── introduction_generator.py # Main orchestrator
├── data/
│ └── papers/ # Place your PDF papers here
├── output/
│ ├── introductions/ # Generated introductions
│ └── references/ # BibTeX files
├── chroma_db/ # Vector database (auto-created)
├── config.toml # Configuration file
├── main.py # CLI entry point
├── pyproject.toml # Project dependencies
├── USAGE.md # Detailed usage guide
└── README.md
The easiest way to use the system is with the automatic pipeline:
1. Add your research papers:
cp /path/to/your/papers/*.pdf data/papers/2. Run the auto pipeline:
uv run python main.py --autoThis will:
- Automatically index all PDFs (or use existing index)
- Prompt you for research topic and context
- Generate introduction with proper BibTeX citations
- Save everything to
output/directory
Example session:
$ uv run python main.py --auto
╔═══════════════════════════════════════════════════════════╗
║ Scientific Literature RAG - AUTO PIPELINE MODE ║
║ Automatic Indexing + Introduction Generation ║
╚═══════════════════════════════════════════════════════════╝
Found 6 PDF files in data/papers/
============================================================
RESEARCH INFORMATION
============================================================
Research topic/title: Phase Transitions in Neural Networks
Research context (background, objectives, etc.):
This study investigates phase transitions in deep neural networks
during training, examining critical points and their implications
for optimization strategies.
[Press Enter twice]
Output filename (without extension) [introduction]: neural_phase_transitions
============================================================
GENERATING INTRODUCTION
============================================================
Retrieving 15 relevant chunks...
Retrieved 15 chunks from literature
Calling Claude API (claude-3-5-sonnet-20241022)...
Introduction generated successfully
Generating BibTeX references...
============================================================
PIPELINE COMPLETE!
============================================================
✓ Introduction: output/introductions/neural_phase_transitions.txt
✓ BibTeX file: output/references/neural_phase_transitions.bib
============================================================For more control, use the interactive menu:
uv run python main.py --interactiveThis provides a menu with options to:
- Index papers (first time or to update)
- Generate introduction
- View collection stats
- Exit
Files are saved in the output/ directory:
output/introductions/<name>.txt- Generated introduction with citationsoutput/references/<name>.bib- LaTeX-ready BibTeX file
The generated .bib file is ready to use directly in your LaTeX documents:
\documentclass{article}
\bibliographystyle{plain}
\begin{document}
\section{Introduction}
Phase transitions in neural networks have been extensively studied
\cite{Ziyin2022, Halverson2021}...
\bibliography{references/neural_phase_transitions}
\end{document}Files are saved in the output/ directory after running the pipeline:
File: output/introductions/introduction_example.md
Active matter comprises systems of self-propelled particles that continuously consume
energy to generate directed motion, representing a frontier in soft matter physics and
nonequilibrium statistical mechanics. This fundamental departure from passive systems
in thermal equilibrium creates rich dynamical phenomena and novel collective behaviors
inaccessible to quiescent materials. The defining characteristic of active matter lies
in the intrinsic capacity of particles to convert chemical, thermal, or other energy
sources into mechanical work, driving the system far from equilibrium and enabling
self-organization, directed transport, and complex emergent dynamics.
Active particles span biological and synthetic scales with diverse mechanisms of
propulsion. In biological contexts, active particles frequently compete with multiple
environmental forces simultaneously. For instance, microorganisms embedded in ice
develop sophisticated strategies including the secretion of exopolymeric substances
(EPS) and antifreeze glycoproteins (AFP) that enhance interfacial liquidity, thereby
modifying their interaction with their frozen environment [Wettlaufer]. Understanding
how such bioparticles respond to environmental forcing—including temperature and
chemical gradients—proves essential for applications spanning astrobiology,
paleoclimatology, and materials science [VachierPhysicalReview2021].
[continues with well-structured, cited introduction]
File: output/references/test_test.bib
The BibTeX extractor generates complete, LaTeX-ready citations with:
- Proper author formatting
- Title, year, journal, volume, pages
- DOI when available
- Clean citation keys (
AuthorYYYYformat)
@misc{CalvinKLee2020,
author = {Lee, C. and others},
note = {Source: lee-et-al-2020-social-cooperativity-of-bacteria-during-reversible-surface-attachment-in-young-biofilms-a-quantitative.pdf},
title = {Social Cooperativity of Bacteria during Reversible Surface Attachment in Young Biofilms: a Quantitative Comparison of Pseudomonas aeruginosa PA14 and PAO1},
year = {2020}
}
@misc{FragkopoulosAA2021,
author = {Fragkopoulos, A. and others},
doi = {10.1098/rsif.2021.0553},
note = {Source: fragkopoulos-et-al-2021-self-generated-oxygen-gradients-control-collective-aggregation-of-photosynthetic-microbes.pdf},
title = {Self-generated oxygen gradients control collective aggregation of photosynthetic microbes},
year = {2021}
}
@article{VachierPhysicalReview2021,
author = {Vachier, J.},
journal = {Physical Review E},
note = {Source: PhysRevE.105.024601.pdf},
pages = {024601},
title = {Premelting controlled active matter in ice},
volume = {105},
year = {2021}
}These entries work directly in LaTeX with \cite{VachierPhysicalReview2021} commands!
All system settings are centralized in config.toml:
[paths]
data_dir = "./data/papers"
chroma_dir = "./chroma_db"
output_dir = "./output"
[processing]
chunk_size = 500 # Words per chunk
chunk_overlap = 50 # Overlap between chunks
[embedding]
model_name = "allenai/specter2_base"
use_mps = true # Use Metal GPU on Apple Silicon
batch_size = 32
[retrieval]
collection_name = "scientific_papers"
n_chunks = 10 # Number of chunks to retrieve
similarity_metric = "cosine"
[generation]
provider = "anthropic" # Options: anthropic, openai, google
model = "claude-3-5-sonnet-20241022"
max_tokens = 2000
temperature = 0.7
[logging]
level = "INFO"
format = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"Change chunk size for different document types:
[processing]
chunk_size = 1000 # Larger chunks for review articles
chunk_overlap = 100Use different embedding model:
[embedding]
model_name = "allenai/specter2_aug2020" # Alternative SPECTER version
use_mps = false # Disable GPU on non-Apple systemsAdjust retrieval for more context:
[retrieval]
n_chunks = 15 # Retrieve more chunks for complex topicsUse different LLM models:
# Claude (Anthropic)
[generation]
provider = "anthropic"
model = "claude-3-5-sonnet-20241022" # Best quality
# model = "claude-3-haiku-20240307" # Faster, cheaper
# OpenAI
[generation]
provider = "openai"
model = "gpt-4-turbo" # High quality
# model = "gpt-4o" # Latest, efficient
# model = "gpt-3.5-turbo" # Cost-effective
# Google Gemini
[generation]
provider = "google"
model = "gemini-pro"
# model = "gemini-1.5-pro" # Latest versionCustomize prompts for your domain:
[generation.prompts]
system_prompt = """You are an expert in theoretical physics and quantum mechanics..."""
introduction_template = """Write an introduction focusing on mathematical rigor..."""See config.toml for full prompt templates that can be customized.
Embedding & Search (Apple M1/M2/M3 with Metal GPU):
- Embedding generation: 50-100 chunks/second
- Vector search: <100ms for 10 results
- Total indexing: ~1 minute per 100 PDFs
Introduction Generation:
- Claude API: 10-30 seconds per introduction
- OpenAI GPT-4: 15-40 seconds per introduction
- Google Gemini: 10-25 seconds per introduction
- Times vary based on API latency and model selection
Typical API Costs (per introduction):
- Claude 3.5 Sonnet: $0.05-0.10
- OpenAI GPT-4-Turbo: $0.08-0.15
- Google Gemini: $0.02-0.05
The system includes a comprehensive evaluation module for assessing the quality of generated introductions using ROUGE metrics.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures text similarity between generated and reference texts:
- ROUGE-1: Unigram (single word) overlap between texts
- ROUGE-2: Bigram (two-word sequence) overlap
- ROUGE-L: Longest Common Subsequence similarity
Each metric reports:
- Precision: Proportion of generated text words in reference
- Recall: Proportion of reference words found in generated text
- F1: Harmonic mean of precision and recall
from src.rag_evaluator import RAGEvaluator
# Initialize evaluator
evaluator = RAGEvaluator()
# Evaluate a single generation
reference = "Your reference introduction text..."
generated = "Generated introduction from the RAG system..."
results = evaluator.evaluate_generation(reference, generated)
# Access ROUGE scores
print(f"ROUGE-1 F1: {results['rouge_scores']['rouge1']['f1']}")
print(f"ROUGE-2 F1: {results['rouge_scores']['rouge2']['f1']}")
print(f"ROUGE-L F1: {results['rouge_scores']['rougeL']['f1']}")
# Access text statistics
print(f"Token count: {results['text_statistics']['token_count']}")
print(f"Type-Token Ratio: {results['text_statistics']['type_token_ratio']}")Evaluate multiple introductions against reference texts:
references = ["ref_intro_1", "ref_intro_2", "ref_intro_3"]
generated = ["gen_intro_1", "gen_intro_2", "gen_intro_3"]
batch_results = evaluator.batch_evaluate(references, generated)
print(f"Average ROUGE-1 F1: {batch_results['rouge_averages']['rouge1_f1_avg']}")
print(f"Average token count: {batch_results['text_statistics_averages']['avg_token_count']}")Include retrieved documents in evaluation to assess retrieval quality:
retrieved_docs = ["chunk_1", "chunk_2", "chunk_3"]
results = evaluator.evaluate_generation(
reference=reference,
generated=generated,
retrieved_docs=retrieved_docs
)
# Evaluate how well retrieved docs match the reference
print(f"Avg retrieval relevance: {results['retrieval_metrics']['average_relevance_score']}")Good ROUGE Scores (>0.4):
- Indicates substantial semantic overlap
- Generated text captures main points from reference
- Suitable for automatic evaluation
Moderate ROUGE Scores (0.2-0.4):
- Some semantic overlap
- Paraphrasing or different phrasing reduces scores
- May still be acceptable depending on use case
Low ROUGE Scores (<0.2):
- Minimal overlap with reference
- May indicate retrieval or generation issues
- Review retrieved documents and prompt engineering
Evaluation of the example introduction "Introduction to Active Matter" (from output/introductions/introduction_example.md) against reference content from the first three paragraphs:
| Metric | F1 Score | Precision | Recall |
|---|---|---|---|
| ROUGE-1 | 0.5245 | 0.3555 | 1.0000 |
| ROUGE-2 | 0.5178 | 0.3507 | 0.9890 |
| ROUGE-L | 0.5245 | 0.3555 | 1.0000 |
Generation Statistics:
- Token Count: 758 tokens
- Sentence Count: 33 sentences
- Average Sentence Length: 22.97 tokens
- Type-Token Ratio: 0.5554 (good vocabulary richness)
Interpretation:
- ROUGE-1 (0.5245): Good score indicating strong word-level coverage of reference content
- ROUGE-2 (0.5178): Good score showing strong phrase-level overlap
- ROUGE-L (0.5245): Good score reflecting strong structural similarity (perfect recall = 1.0)
The good ROUGE scores (>0.4) demonstrate that the generated introduction successfully captures and reproduces key content from the reference while maintaining high-quality academic writing with varied phrasing and structure.
To run this evaluation yourself:
uv run python scripts/test_rouge.pyIf you see "Using CPU" instead of "Using Metal GPU acceleration":
# Check if MPS is available
uv run python -c "import torch; print(torch.backends.mps.is_available())"If False, ensure you're using a compatible PyTorch version on Apple Silicon.
Make sure you're running commands with uv run:
uv run python main.pyIf you encounter ChromaDB errors, try clearing the database:
rm -rf chroma_db/Then re-index your papers.
Install development dependencies:
# Install with dev tools
uv sync --all-extras
# Or install just dev dependencies
uv add --dev pytest pytest-cov black ruff mypy bandit interrogate pre-commit# Run all tests
uv run pytest
# Run tests with coverage
uv run pytest --cov=src --cov-report=html
# Run specific test file
uv run pytest tests/test_config.py -v
# Run tests in parallel
uv run pytest -n autoSet up pre-commit hooks to run checks automatically on every commit:
# Install pre-commit hooks
uv run pre-commit install
# Run hooks manually on all files
uv run pre-commit run --all-filesPre-commit checks:
- Black: Code formatting
- Ruff: Linting (import sorting, naming, etc.)
- MyPy: Static type checking
- Bandit: Security vulnerability scanning
- Interrogate: Docstring coverage
# Format code with Black
uv run black src/ tests/
# Check linting with Ruff
uv run ruff check src/ tests/ --fix
# Type checking
uv run mypy src/ --ignore-missing-imports
# Security checks
uv run bandit -r src/
# Docstring coverage
uv run interrogate src/ -v -I -i -nThe project uses GitHub Actions for continuous integration:
Workflows:
- tests.yml: Runs tests on Python 3.11-3.12, macOS/Ubuntu/Windows
- code-quality.yml: Linting, type checking, security checks
- release.yml: Automated releases and PyPI publishing
View workflow status: Actions
uv add package-name
# Add dev dependency
uv add --dev package-nameThis project intentionally does not use Docker for the following reasons:
- GPU Access Complexity: Metal GPU (Apple Silicon) and CUDA access is significantly more complex in containers
- Model Download Size: SPECTER2 and sentence-transformers models (~500MB) would need re-downloading on each container rebuild
- Development Workflow: Local development with
uvis simpler and faster for research tools - File Persistence: ChromaDB and output files work more reliably with native filesystem access
- Performance: Native execution provides better performance for ML workloads
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Apache License 2.0 - see LICENSE file for details
If you use this tool in your research, please cite:
@software{scientific_literature_rag,
title = {Scientific Literature RAG - Introduction Generator},
author = {Jeremy Vachier},
year = {2025},
url = {https://github.com/jvachier/scientific-literature-rag}
}Author: Jeremy Vachier
GitHub: @jvachier
For questions or issues, please use the GitHub issue tracker.