Tools needed to take pdf files, convert them to markdown, and the embed them in ChromaDB on Linux. Designed for SPEAR.
This repository implements a two-stage document processing and retrieval pipeline for academic PDFs:
- Nougat OCR + merge + cleaning
- RAG ingestion (chunking + embeddings + Chroma persistence)
The stages are intentionally separated and run in different Conda environments to keep dependencies clean and reproducible.
Jupyter notebooks are optional and not required to run the pipeline.
PDFs
↓
[Stage 1] Nougat OCR + merge + clean (conda env: nougat)
↓
Clean Markdown
↓
[Stage 2] RAG ingestion (chunk + embed + persist) (conda env: rag_new)
↓
Chroma vector database
conda env create -f envs/nougat.yml
conda activate nougat
pip install -r envs/nougat.pip.txtconda env create -f envs/rag.yml
conda activate rag_new
pip install -r envs/rag.pip.txtConfigure paths
Edit:
scripts/nougat_stage.confKey variables:
INPUT_PDF_DIR=... # directory containing input PDFs
NOUGAT_OUT_DIR=... # raw Nougat outputs
MERGED_MD_DIR=... # final cleaned Markdown output
LOG_DIR=... # per-PDF Nougat logs
CONDA_ENV=nougat
Run Stage 1
bash scripts/nougat_stage.sh scripts/nougat_stage.confThis stage:
-
Activates the nougat Conda environment
-
Runs Nougat on every *.pdf in INPUT_PDF_DIR
-
Captures per-PDF logs in LOG_DIR
-
Merges Nougat output into one Markdown file per PDF
-
Removes boilerplate text once, upstream
-
Writes provenance files:
pdf_sha256.json
manifest.json
Output structure:
MERGED_MD_DIR/
├── paper1.md
├── paper2.md
├── pdf_sha256.json
└── manifest.json
Configure paths
Edit:
scripts/rag_stage.confKey variables:
MERGED_MD_DIR=... # output from Stage 1
CHROMA_DIR=... # Chroma persistence directory
COLLECTION=... # Chroma collection name
EMBEDDING_MODEL=... # must match query usage
CHUNK_SIZE=1200
CHUNK_OVERLAP=150
CONDA_ENV=rag_new
Run Stage 2
bash scripts/rag_stage.sh scripts/rag_stage.confThis stage:
-
Activates the rag Conda environment
-
Loads cleaned Markdown files
-
Chunks documents
-
Embeds text using the configured model
-
Persists embeddings into Chroma
This stage is write-only:
To query the database from the terminal:
conda activate rag
./scripts/query_chroma.py \
--chroma_dir /path/to/chroma_db \
--collection nougat_merged \
--query "Atlantic Meridional Overturning circulation"This is a read-only operation and does not modify the database.
-
Each stage explicitly activates its required Conda environment
-
Boilerplate removal happens only in Stage 1
-
Stage 2 assumes inputs are already clean
-
Stages can be run independently
-
Scripts are suitable for batch jobs, cron, SLURM, or CI pipelines
-
Environment separation avoids dependency conflicts between OCR and RAG tooling
-
Do not run both stages in the same Conda environment
-
Do not re-embed with a different embedding model unless rebuilding the database
-
If PDFs change, rerun Stage 1 before Stage 2
-
If only chunking parameters change, rerun Stage 2 only