Enhance DOI extraction to support legacy and alternative citation formats

### Summary

The current DOI extraction logic (via extractor.extract_doi) primarily supports modern formats such as https://doi.org/... and doi:10.xxxx/....

However, legacy and alternative DOI citation formats commonly found in PDFs are not always reliably detected due to formatting variations.

### Problem

In PDF-extracted text, DOIs often appear in inconsistent formats such as:

DOI 10.1007/s11089-018-0833-1
DOI:    10.1038/nature12373
doi 10.1126/science.123456

These variations (missing colon, inconsistent spacing, casing, or line breaks) may not always be captured by the current extraction logic.

### Proposed Enhancement

Improve extractor.extract_doi to:

Support legacy DOI prefixes (e.g., DOI, doi) with flexible formatting
Handle inconsistent whitespace and spacing
Be more robust to PDF text extraction artifacts (e.g., line breaks)
Expected Behaviour

All valid DOI identifiers should be consistently extracted and normalized to:

10.xxxx/...

### Notes
This issue focuses only on extraction robustness
No changes to reporting or classification logic are required
Changes should be minimal and avoid breaking existing detection behaviour

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Enhance DOI extraction to support legacy and alternative citation formats #470

Summary

Problem

Proposed Enhancement

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

Enhance DOI extraction to support legacy and alternative citation formats #470

Description

Summary

Problem

Proposed Enhancement

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions