A Python pipeline for extracting and analyzing references from academic PDFs using GROBID.
- π Extract structured references from PDFs using GROBID
- π Automatic DOI resolution via CrossRef API
- π Generate annotated PDFs with highlighted references
- π Checkpoint system for resuming interrupted processing
- π Parallel processing support
- π Full Unicode support
-
Start GROBID Docker:
docker run --rm --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.2
-
Install dependencies:
pip install requests pandas tqdm lxml PyMuPDF
-
Run the pipeline:
cd reference_extraction/scripts python master_pipeline.py --test # Test mode python master_pipeline.py # Process all PDFs
- See reference_extraction/README.md for detailed usage
- See CLAUDE.md for development guidance
This project is for personal research use only.