Skip to content

xxmikexx1/goldfranks-grobid-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Goldfranks GROBID Reference Extraction

A Python pipeline for extracting and analyzing references from academic PDFs using GROBID.

Features

  • πŸ“„ Extract structured references from PDFs using GROBID
  • πŸ” Automatic DOI resolution via CrossRef API
  • πŸ“Œ Generate annotated PDFs with highlighted references
  • πŸ”„ Checkpoint system for resuming interrupted processing
  • πŸš€ Parallel processing support
  • 🌐 Full Unicode support

Quick Start

  1. Start GROBID Docker:

    docker run --rm --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.2
  2. Install dependencies:

    pip install requests pandas tqdm lxml PyMuPDF
  3. Run the pipeline:

    cd reference_extraction/scripts
    python master_pipeline.py --test  # Test mode
    python master_pipeline.py         # Process all PDFs

Documentation

License

This project is for personal research use only.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages