Summary
The current DOI extraction logic (via extractor.extract_doi) primarily supports modern formats such as https://doi.org/... and doi:10.xxxx/....
However, legacy and alternative DOI citation formats commonly found in PDFs are not always reliably detected due to formatting variations.
Problem
In PDF-extracted text, DOIs often appear in inconsistent formats such as:
DOI 10.1007/s11089-018-0833-1
DOI: 10.1038/nature12373
doi 10.1126/science.123456
These variations (missing colon, inconsistent spacing, casing, or line breaks) may not always be captured by the current extraction logic.
Proposed Enhancement
Improve extractor.extract_doi to:
Support legacy DOI prefixes (e.g., DOI, doi) with flexible formatting
Handle inconsistent whitespace and spacing
Be more robust to PDF text extraction artifacts (e.g., line breaks)
Expected Behaviour
All valid DOI identifiers should be consistently extracted and normalized to:
10.xxxx/...
Notes
This issue focuses only on extraction robustness
No changes to reporting or classification logic are required
Changes should be minimal and avoid breaking existing detection behaviour
Summary
The current DOI extraction logic (via extractor.extract_doi) primarily supports modern formats such as https://doi.org/... and doi:10.xxxx/....
However, legacy and alternative DOI citation formats commonly found in PDFs are not always reliably detected due to formatting variations.
Problem
In PDF-extracted text, DOIs often appear in inconsistent formats such as:
DOI 10.1007/s11089-018-0833-1
DOI: 10.1038/nature12373
doi 10.1126/science.123456
These variations (missing colon, inconsistent spacing, casing, or line breaks) may not always be captured by the current extraction logic.
Proposed Enhancement
Improve extractor.extract_doi to:
Support legacy DOI prefixes (e.g., DOI, doi) with flexible formatting
Handle inconsistent whitespace and spacing
Be more robust to PDF text extraction artifacts (e.g., line breaks)
Expected Behaviour
All valid DOI identifiers should be consistently extracted and normalized to:
10.xxxx/...
Notes
This issue focuses only on extraction robustness
No changes to reporting or classification logic are required
Changes should be minimal and avoid breaking existing detection behaviour