Skip to content

Enhance DOI extraction to support legacy and alternative citation formats #470

Description

@Aakash-pal

Summary

The current DOI extraction logic (via extractor.extract_doi) primarily supports modern formats such as https://doi.org/... and doi:10.xxxx/....

However, legacy and alternative DOI citation formats commonly found in PDFs are not always reliably detected due to formatting variations.

Problem

In PDF-extracted text, DOIs often appear in inconsistent formats such as:

DOI 10.1007/s11089-018-0833-1
DOI: 10.1038/nature12373
doi 10.1126/science.123456

These variations (missing colon, inconsistent spacing, casing, or line breaks) may not always be captured by the current extraction logic.

Proposed Enhancement

Improve extractor.extract_doi to:

Support legacy DOI prefixes (e.g., DOI, doi) with flexible formatting
Handle inconsistent whitespace and spacing
Be more robust to PDF text extraction artifacts (e.g., line breaks)
Expected Behaviour

All valid DOI identifiers should be consistently extracted and normalized to:

10.xxxx/...

Notes

This issue focuses only on extraction robustness
No changes to reporting or classification logic are required
Changes should be minimal and avoid breaking existing detection behaviour

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions