A reproducible pipeline for evaluating how language models acquire knowledge beyond their training data cutoff, comparing fine-tuning and retrieval-augmented generation (RAG) approaches.
Experiment Status: COMPLETED — This pipeline was tested on Qwen 2.5 0.5B (a small language model). Results and methodology are documented below. The pipeline is model-agnostic and can be adapted for larger models.
This project provides an end-to-end framework for investigating how LLMs can acquire new scientific knowledge published after their training cutoff. We compare two adaptation strategies:
| Strategy | Approach | Pros | Cons |
|---|---|---|---|
| Fine-tuning | Update model weights with new data | Internalizes knowledge | Expensive, may forget |
| RAG | Retrieve relevant context at inference | No retraining needed | Depends on retrieval quality |
The benchmark uses 2025 arXiv papers (post-cutoff for current open-weight models) as the knowledge source.
| Aspect | Configuration |
|---|---|
| Subject Model | Qwen 2.5 0.5B Instruct (small language model, Q4_K_M quantized) |
| Judge Model | Qwen 3 8B (thinking mode) + Gemini 3 Pro (pairwise) |
| Dataset | 120 arXiv papers (2025), 154 evaluation QA pairs |
| Quantization | Q4_K_M (4-bit) vs F16 comparison |
Scope: The findings below apply to Qwen 2.5 0.5B — a small language model with limited capacity. Larger models (3B, 7B, 14B+) may exhibit different patterns, particularly regarding fine-tuning benefits. This pipeline is designed to test such hypotheses.
We evaluated 6 conditions crossing training approach (none, instruction-only FT, RAG-trained FT) with inference mode (with/without RAG):
| Finding | Evidence |
|---|---|
| RAG provides substantial improvement | 4-5x improvement in pass rate (22% vs 4%) |
| Fine-tuning provides marginal gains | +1.3% when training format matches inference |
| Training format affects transfer | RAG-trained models perform better with RAG at inference |
| Fine-tuning alone insufficient | FT-only models (no RAG) perform similarly to base (4-6%) |
| Comparison | Result |
|---|---|
| Base+RAG vs FT+RAG | 54.9% vs 45.1% win rate (p=0.35, not statistically significant) |
| Judge | Gemini 3 pro (62 FT wins, 51 Base wins, 41 ties) |
| Interpretation | Insufficient evidence that fine-tuning improves RAG performance for this model size |
| Metric | Result |
|---|---|
| Win rate | 48.4% vs 51.6% (p=0.84, no significant difference) |
| Size reduction | 60% smaller (397MB vs 994MB) |
| Conclusion | Q4_K_M quantization does not degrade quality for this model |
For Qwen 2.5 0.5B on post-cutoff scientific QA:
- RAG is essential — provides the primary performance gain
- Fine-tuning has minimal impact — Base+RAG achieves ~95% of best performance
- Q4_K_M is viable — no measurable quality loss vs F16
- Open question — larger models may show different fine-tuning benefits
Detailed reports: Six-Condition Results | Pairwise Evaluation | Quantization Analysis
The project consists of 5 main stages:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ 1. FETCH │───▶│ 2. INGEST │───▶│ 3. DATASET │───▶│ 4. FINETUNE │───▶│ 5. EVALUATE │
│ Papers │ │ & Index │ │ Generate │ │ Models │ │ Compare │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
arXiv PDF→Text QA/Summary LoRA/PEFT 6 Conditions
Harvest FAISS Index Train/Eval 2 Models LLM Judge
Download recent arXiv papers that are beyond model training cutoffs.
python scripts/data/fetch_arxiv_corpus.py \
--contact-email [email protected] \
--total 100 \
--output-dir data/raw/arxiv_2025Convert PDFs to text, chunk documents, and build a FAISS retrieval index.
python scripts/data/ingest_and_index.py --config configs/default.yamlOutput: data/processed/ (text), data/external/index/ (FAISS)
Create QA pairs, summaries, and citation tasks using a strong generator model.
# Generate offline dataset
python scripts/data/generate_offline_dataset.py --config configs/default.yaml
# Validate quality with LLM judge
python scripts/data/evaluate_dataset_quality.py \
--dataset evaluation/datasets/offline_dataset.jsonl
# Split into train/eval sets
python scripts/data/split_dataset.py \
--input evaluation/datasets/offline_dataset.jsonl \
--train-output evaluation/datasets/train_dataset.jsonl \
--eval-output evaluation/datasets/eval_dataset.jsonlTrain two LoRA models in cloud notebooks (Colab/Kaggle):
| Model | Notebook | Training Data |
|---|---|---|
| Instruction-only | notebooks/finetuning/lora_science_v1_instruction_only.ipynb |
Without RAG contexts |
| RAG-trained | notebooks/finetuning/lora_science_v1.ipynb |
With RAG contexts |
Register with Ollama after training:
ollama create lora_science_0p5_instruction_only -f ollama/Modelfile.instruction_only
ollama create lora_science_0p5 -f ollama/Modelfile.rag_trained| # | Condition | Training | RAG at Eval | Purpose |
|---|---|---|---|---|
| 1 | Base Baseline | None | ✗ | Lower bound |
| 2 | RAG Baseline | None | ✓ | RAG-only benefit |
| 3 | FT Only (instruction) | Instruction-only | ✗ | FT-only benefit |
| 4 | FT+RAG (instruction) | Instruction-only | ✓ | Transfer learning test |
| 5 | FT Only (RAG-trained) | RAG-trained | ✗ | Degradation test |
| 6 | FT+RAG (RAG-trained) | RAG-trained | ✓ | Optimal setup |
python scripts/core/compare_models.py --plan configs/evaluation/six_condition_experiment.yamlSee docs/experiment/ for detailed methodology.
- Python ≥ 3.10
- Ollama for local inference
- macOS (Apple Silicon recommended) or Linux
git clone https://github.com/ignaciolinari/beyond-the-cutoff.git
cd beyond-the-cutoff
make init
source .venv/bin/activate
# (Alternative) If you prefer not to use Make:
# python scripts/bootstrap_env.py
# Pull required models
ollama pull qwen2.5:0.5b-instruct # Base model
ollama pull qwen3:8b # Judge modelOnce you have papers ingested, ask questions:
python scripts/ask.py "What are the main contributions of paper X?"beyond-the-cutoff/
├── apps/ # Interactive Streamlit applications
│ ├── human_annotation.py # Human evaluation annotation UI
│ └── offline_task_viewer.py # Browse generated QA tasks
├── artifacts/ # Training artifacts and checkpoints (generated locally; gitignored)
├── configs/
│ ├── default.yaml # Main pipeline configuration
│ ├── models/ # Model configurations (base, fine-tuned)
│ ├── evaluation/ # Experiment plans (6-condition, pairwise, etc.)
│ └── judges/ # LLM judge configurations
├── data/
│ ├── raw/ # Downloaded PDFs from arXiv
│ ├── processed/ # Extracted text and manifests
│ └── external/ # FAISS index and embeddings
├── docs/
│ ├── experiment/ # Experiment design and methodology
│ ├── reports/ # Analysis results and findings
│ ├── reference/ # Technical documentation
│ ├── scaling/ # Guide for larger models
│ └── future/ # Planned features
├── evaluation/
│ ├── datasets/ # Train/eval JSONL datasets
│ ├── responses/ # Pre-generated model responses
│ ├── results/ # Evaluation metrics and details
│ └── exports/ # Batch files for external judges
├── notebooks/
│ ├── finetuning/ # LoRA training notebooks (Colab/Kaggle)
│ └── data_quality/ # Data analysis notebooks
├── ollama/ # Modelfiles for Ollama registration
├── outputs/ # Fine-tuned model weights (GGUF; generated locally; gitignored)
├── prompts/ # Prompt templates
├── scripts/
│ ├── ask.py # Interactive RAG assistant
│ ├── bootstrap_env.py # Environment setup
│ ├── core/ # Main evaluation pipeline
│ │ ├── compare_models.py
│ │ ├── generate_responses.py
│ │ └── interleaved_evaluation.py
│ ├── data/ # Data processing scripts
│ │ ├── fetch_arxiv_corpus.py
│ │ ├── ingest_and_index.py
│ │ ├── generate_offline_dataset.py
│ │ └── split_dataset.py
│ ├── future/ # Advanced evaluation (ELO, retrieval ablation)
│ ├── utility/ # Analysis and inspection tools
│ └── validation/ # Experiment validation scripts
├── src/beyond_the_cutoff/ # Core Python library
│ ├── data/ # PDF loading, chunking, extraction
│ ├── retrieval/ # FAISS indexing and querying
│ ├── evaluation/ # Judges, metrics, scoring
│ ├── datasets/ # Dataset generation
│ └── models/ # Model interfaces
├── tests/ # Pytest test suite
└── vintage/ # Archived configs and scripts
| Section | Description |
|---|---|
| Experiment Setup | 6-condition design |
| Methodology | Evaluation metrics |
| Pipeline Reference | Full technical details |
See docs/README.md for the complete documentation index.
| Model | Purpose | Ollama Tag |
|---|---|---|
| Qwen 2.5 0.5B | Base model for experiments | qwen2.5:0.5b-instruct |
| Qwen 3 8B | LLM judge | qwen3:8b |
| Fine-tuned (instruction) | Trained without RAG contexts | lora_science_0p5_instruction_only |
| Fine-tuned (RAG-trained) | Trained with RAG contexts | lora_science_0p5 |
pytest tests/ # Run tests
pre-commit run --all-files # Linting & formatting
mypy src/ # Type checkingThis pipeline is model-agnostic. To test with different models:
| Model Size | Ollama Tag | Notes |
|---|---|---|
| 3B | qwen2.5:3b-instruct |
Moderate capacity increase |
| 7B | qwen2.5:7b-instruct-q4_K_M |
~11GB VRAM |
| 14B+ | qwen2.5:14b-instruct-q4_K_M |
~16GB VRAM |
- Update
configs/models/base_ollama.yamlwith the new model tag - Fine-tune using the notebooks (adjust for model size)
- Re-run the evaluation pipeline
| Aspect | 0.5B (Observed) | Larger Models (Hypothesis) |
|---|---|---|
| FT benefit | Marginal | May increase with capacity |
| Knowledge retention | Limited | Better with more parameters |
| RAG dependence | Critical | May decrease |
See docs/scaling/ for detailed instructions.
Features implemented but not executed in this experiment:
| Feature | Script | Purpose |
|---|---|---|
| Live retrieval evaluation | scripts/future/evaluate_end_to_end.py |
Test with dynamic retrieval |
| Retrieval ablation | scripts/future/run_retrieval_ablation.py |
Optimize top_k and rerankers |
| ELO tournament | scripts/future/compute_elo_rankings.py |
Multi-model ranking |
See docs/future/ and scripts/future/README.md.
MIT License. See LICENSE for details.